Projections using Hypertuned model through XGboost

All data is from FanGraphs. I have no affiliation with FanGraphs, but please consider contributing to their website if you found this project informative.

1 Project Scope

1.1 Objective

This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection

The Categories used for prediction valuation are year-end rankings for the following metrics:

  • Wins
  • Saves
  • Strike Outs
  • ERA ( 9 * Earned Runs per Inning Pitched)
  • WHIP (Walks and Hits per Inning Pitched)



2 Processing the Data

2.1 Getting Data Into R

2.1.1 Load Libraries

First we need to load the packages that R needs to run the analysis

library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling 
library(Matrix) #creating matricies for xgboost
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance 
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot 
library(owmr) #Removing Prefixes
library(kableExtra) # formatting HTML Tables
library(formattable) # formatting HTML Tables

The # comments generally explain what additional functionality each library adds to R

2.1.2 Load in Data

All data is downloaded from Fan Graphs from this location. The data is also available on my Github here. There are player level and team data sets


#data read-in
pitcher_data <- read_csv("FanGraphs Leaderboard_Pitching20IP.csv")
#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")
#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>% 
  rename_with( ~ paste0("T_", .x))

2.1.3 Checking Team Data

str give information about an object, while skim provides a customizable summary


#Output not shown for space
#str(FDG_Team2)

skim(FDG_Team2) %>%  
  tibble::as_tibble() #Remove this option for a normal HTML table

2.2 Understanding the Dataset

2.2.1 Exploring the dataset

skim let’s us see how the data was imported into R. Documentation can be found here


#Full Dataset dimensions

skimr::skim(pitcher_data) %>% 
  tibble::as_tibble() %>%  #Remove this option for a normal HTML table
  select(skim_type,skim_variable,complete_rate) %>% 
  filter(complete_rate >0.30) #250 Variables

#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populated

Additionally let’s look at how variables vary by year to see if there are any discrepancies there


#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
pitcher_data_dist =
pitcher_data %>% 
 group_by(Season) %>% 
  summarize (Max_Games = max(G),
             Avg_W= mean(W)
             )

pitcher_data_dist

#Plot Win Data by Year
ggplot(pitcher_data_dist, aes(Season, Avg_W)) +
  geom_col()+
  ggtitle("Average Wins by Year")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))


2.3 Cleaning and Creating Initial Dataset for Model

What are some issues with the data?

  1. Many of Variables, such as K%, are being read in as characters

    • Only Team and Player Name should be characters
  2. There is spotty data coverage in some of the variables (~Variables have less than 30% Coverage)

  3. 2020 Data only includes 60 games worth of data

    • This was a season shortened due to Covid-19
  4. Team Data needs to be appended to pitcher Data by Team Name


2.3.1 Cleanly Changing all Variables that are characters to numeric.

There are several ways to do this, we will identify the variables we want to change that are mis-identified. parse_number can be used to pull numbers from these variables. Additional ways to tackle this can be found here.


#Select Column names that are characters but not Team or Name, These should be percentages
pitcher_data_chars_to_convert <- pitcher_data %>% 
  select_if(is.character)%>% select(-Team,-Name) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution


#We can exclude the variables we converted and reintroduce them
pitcher_data_num <- pitcher_data %>% select(-colnames(pitcher_data_chars_to_convert))

pitcher_data2 = cbind(pitcher_data_num,pitcher_data_chars_to_convert) %>% 
  select (colnames(pitcher_data)) %>%  #preserve original order 
  dplyr::rename(flyball_perc = `FB%...50`,fastball_perc = `FB%...74`) #rename two ambiguous columns
  
skim(pitcher_data2) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()


#Logical variables are R's best guess, in our case they are all NA's and will be removed at a later step

The same can be done for the Team Data that is loaded


#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>% 
  select_if(is.character)%>% select(-T_Team) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using

#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))

FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>% 
  select (colnames(FDG_Team2)) %>%  #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`)  #rename two ambiguous columns

skim(FDG_Team3) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()

2.3.2 Filtering Data with Low Coverage

I choose 30% coverage of data necessary but this can be adjusted up or down. This will also get rid of columns that are all NA.


# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(pitcher_data2) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)

#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep) 

#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
pitcher_data3 = pitcher_data2 %>% 
  select(one_of(Player_cols_to_keep)) 

Repeat the process for Team Variables

Team_cols_to_keep =
skim(FDG_Team3) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)


#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep) 

#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>% 
  select(one_of(Team_cols_to_keep)) 

2.3.3 Creating Variables Normalized by Year

Some Variables will need to be normalized by Innings_Pitched (IP) if they aren’t a percentage already. Remaining Variables are percentages or indices so will not need to be transformed. The full data dictionary for these variables can be found on FanGraph’s website here. for pitching variables and here. for hitting variables.



pitcher_data4 = pitcher_data3 %>% 
  mutate( #create new variables based on existing variables
    W_IP = W/IP,
    L_IP =  L/IP, 
    ShO_IP = ShO/IP,
    SV_IP = SV/IP,
    BS_IP = BS/IP,
    TBF_IP = TBF/IP,
    H_IP = H/IP,
    R_IP = R/IP,
    ER_IP = ER/IP,
    HR_IP=HR/IP,
    BB_IP=BB/IP,
    IBB_IP=IBB/IP,
    HBP_IP=HBP/IP,
    WP_IP= WP/IP,
    BK_IP=BK/IP,
    SO_IP=SO/IP,
    GB_IP = GB/IP,   #Groundballs
    FB_IP =  FB/IP,  #FlyBalls
    LD_IP = LD/IP,   #LineDrives
    IFFB_IP = IFFB/IP,  #Infield Fly balls
    Balls_IP= Balls/IP,
    Strikes_IP= Strikes/IP,
    Pitches_IP= Pitches/IP,
    RS_IP= RS/IP,
    IFH_IP= IFH/IP,
    BU_IP= BU/IP,
    BUH_IP= BUH/IP,
    Pulls_IP= Pulls/IP,
    HLD_IP= HLD/IP,   
    SD_IP= SD/IP,    
    MD_IP= MD/IP,    
    Barrels_IP= Barrels/IP,
    HardHits_IP= HardHit/IP
  ) %>% select(-L,-G,-IP,-ShO,-BS,-(TBF:BK),-(GB:BUH),-Pulls,-(SD:MD),-Barrels,-HardHit,-Events)
               
#will be removed after data is lagged -FIP,-(RAR:WPA),,-(wFB:wCH),-(`ERA-`:`xFIP-`),-SIERA,-(`RA9-WAR`:`Age Rng`),-kwERA,-`wCH (pi)`:`wSL (pi)`,`K/9+`:`HR/FB%+`) 

#skim(pitcher_data4) %>% as_tibble()

Repeat the process for Team Variables


FDG_Team5 = FDG_Team4 %>% 
  mutate( #create new variables based on existing variables
    T_H_T_PA = T_H/T_PA,
    T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
    T_x2b_T_PA = T_2B/T_PA,
    T_x3b_T_PA = T_3B/T_PA,
    T_HR_T_PA = T_HR/T_PA,
    T_R_T_PA = T_R/T_PA,
    T_RBI_T_PA = T_RBI/T_PA,
    T_BB_T_PA = T_BB/T_PA,
    T_IBB_T_PA = T_IBB/T_PA,
    T_SO_T_PA=T_SO/T_PA,
    T_HBP_T_PA=T_HBP/T_PA,
    T_SF_T_PA=T_SF/T_PA,
    T_SH_T_PA=T_SH/T_PA,
    T_GDP_T_PA= T_GDP/T_PA,#ground into double play
    T_SB_T_PA=T_SB/T_PA,
    T_CS_T_PA=T_CS/T_PA,
    T_GB_T_PA = T_GB/T_PA,   #Groundballs
    T_FB_T_PA =  T_FB/T_PA,  #FlyBalls
    T_LD_T_PA = T_LD/T_PA,   #LineDrives
    T_IFFB_T_PA = T_IFFB/T_PA,  #Infield Fly balls
    T_Pitches_T_PA= T_Pitches/T_PA,
    T_Balls_T_PA= T_Balls/T_PA,
    T_Strikes_T_PA= T_Strikes/T_PA,
    T_IFH_T_PA= T_IFH/T_PA,
    T_BU_T_PA= T_BU/T_PA,
    T_BUH_T_PA= T_BUH/T_PA,
    T_PH_T_PA= T_PH/T_PA,
    T_Barrels_T_PA= T_Barrels/T_PA,
    T_HardHits_T_PA= T_HardHit/T_PA
  ) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables


#skim(FDG_Team5) %>% as_tibble()

2.3.4 Creating Lagged Variables

There are several ways to lag a dataset BY GROUP.
* Dplyr way is here..
* The data.table (the method used below) is here.

#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance


#Order the dataset by lag columns
pitcher_data5 =  arrange(pitcher_data4, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_pitcher = data.table(pitcher_data5)

#designate columns to lag - which is all of them
cols1 = colnames(pitcher_data5)
anscols = paste("lag", cols1, sep="_")
DT_pitcher[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

pitcher_data6 = as.data.frame(DT_pitcher) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)

ncol(pitcher_data5) #251 - no lags
[1] 251
ncol(pitcher_data6) #497 - lagged data ~ (251 * 2)-5
[1] 497

2.3.5 Merging Team and Player Data

We can use either the merge function or the SQL functionality provided by the sqldf package to join the lagged player level data to the Team level data


df_pitching_init = sqldf(
  "
  select a.*, b.*
  from pitcher_data6 a
  left join FDG_Team5 b
  on a.Team = b.T_Team and a.Season = b.T_Season
  
  "
)  %>% select(-T_Team,-T_Season,-T_Age,-T_G,-T_AB)# Unncessary Team Variables


nrow(df_pitching_init) - nrow(pitcher_data6) #check if any rows are duplicated
[1] 0

3 Creating Rankings for Players Based On Percentiles

We can use Percentile based ranking to get rankings for players from the 2021 season.

3.1 Worth of each stat

3.1.1 Calculating past performance

Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. Data is not normalized by IP as certain stats such as Wins will be worth more when we do.


#Categories I include are:
#Wins, Saves, WHIP, ERA, SOs, Holds

df_pitching_init2 =  df_pitching_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Wins_share = order(order(rank(W_IP,ties.method = 'average'),decreasing = FALSE))/n(),
     SO_share = order(order(rank(SO_IP,ties.method = 'average'),decreasing = FALSE))/n(),
     SV_share = order(order(rank(SV_IP,ties.method = 'average'),decreasing = FALSE))/n(),
     WHIP_share = order(order(rank(WHIP,ties.method = 'average'),decreasing = FALSE))/n(),
     ERA_share = order(order(rank(ERA,ties.method = 'average'),decreasing = FALSE))/n(),
    HLD_share = 0,
    Worth = Wins_share+SO_share+SV_share+WHIP_share+ERA_share+HLD_share
    ) %>% 
  ungroup() 

Chart of the Distribution of initial percentiles
As the chart below shows, the data is roughly normal.


skewness((df_pitching_init2$Worth))
[1] 0.086
ggplot2::qplot(df_pitching_init2$Worth, main="Total Pitching Worth Dataset") + geom_histogram(colour="black", fill="steelblue")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

min(df_pitching_init2$Worth)
[1] 0.32
max(df_pitching_init2$Worth)
[1] 4.6
ggpubr::ggqqplot(df_pitching_init2$Worth)


shapiro.test(df_pitching_init2$Worth)

    Shapiro-Wilk normality test

data:  df_pitching_init2$Worth
W = 1, p-value = 0.002

3.2 2021 Player Rankings - Per IP performance

3.2.1 2021 Player Rankings - Top Worth Players with Holds

Total Rankings for the players (Using 5x5 Scoring) can be found here. While it looks like many of the top players have low worth scores, it is because we haven’t applied a modifier for IP yet. Wins are harder to come by relative to any other stat and require more innings pitched.



df_pitching_init2_raw =  df_pitching_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Wins_share_raw = order(order(rank(W,ties.method = 'average'),decreasing = FALSE))/n(),
     SO_share_raw = order(order(rank(SO,ties.method = 'average'),decreasing = FALSE))/n(),
     SV_share_raw = order(order(rank(SV,ties.method = 'average'),decreasing = FALSE))/n(),
     WHIP_share = order(order(rank(WHIP,ties.method = 'average'),decreasing = FALSE))/n(),
     ERA_share = order(order(rank(ERA,ties.method = 'average'),decreasing = FALSE))/n(),
    HLD_share_raw = 0,
    Worth = Wins_share_raw+SO_share_raw+SV_share_raw+WHIP_share+ERA_share+HLD_share_raw
    ) %>% 
  ungroup() %>% 
select(-W,-SO,-SV,-WHIP,-ERA,-HLD)



options(digits=2)

df_pitching_init2021_raw =
df_pitching_init2_raw %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Wins_share_raw,SO_share_raw,SV_share_raw,WHIP_share,ERA_share,Worth)


df_pitching_init2021_raw %>%
  filter (Worth>3.5) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
Name Wins_share_raw SO_share_raw SV_share_raw WHIP_share ERA_share Worth
Daniel Bard 0.79 0.69 0.97 0.87 0.75 4.0
Garrett Richards 0.79 0.84 0.86 0.87 0.68 4.0
Jesus Luzardo 0.77 0.78 0.55 0.89 0.92 3.9
Brady Singer 0.72 0.89 0.66 0.82 0.69 3.8
Brad Keller 0.86 0.85 0.31 0.92 0.77 3.7
Jose Alvarado 0.82 0.60 0.90 0.87 0.51 3.7
Mitch Keller 0.69 0.76 0.40 0.96 0.87 3.7
Justus Sheffield 0.81 0.56 0.37 0.98 0.95 3.7
Nick Pivetta 0.89 0.95 0.72 0.51 0.60 3.7
Rafael Montero 0.65 0.34 0.91 0.81 0.91 3.6
Josh Fleming 0.93 0.58 0.78 0.59 0.73 3.6
Alec Mills 0.74 0.73 0.71 0.69 0.73 3.6
Joe Jimenez 0.75 0.49 0.73 0.80 0.85 3.6
Paul Fry 0.58 0.52 0.83 0.80 0.87 3.6
Wil Crowe 0.61 0.82 0.53 0.84 0.78 3.6
Erick Fedde 0.82 0.87 0.38 0.70 0.78 3.5
Adam Ottavino 0.77 0.62 0.93 0.70 0.51 3.5
Bryan Garcia 0.50 0.22 0.85 0.98 0.98 3.5
Alex Reyes 0.92 0.77 0.98 0.58 0.25 3.5

3.3 2021 Player Rankings - Actual Performance

3.3.1 2021 Player Rankings - Top Worth Players with Holds

While it looks like many of the top players have low worth scores, it is because we haven’t applied a modifier for IP yet.


options(digits=2)

df_pitching_init2021 =
df_pitching_init2 %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Wins_share,SO_share,SV_share,WHIP_share,ERA_share,HLD_share,Worth)


df_pitching_init2021 %>%
  filter (Worth>2.9) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
Name Wins_share SO_share SV_share WHIP_share ERA_share HLD_share Worth
Daniel Bard 0.93 0.85 0.97 0.87 0.75 0 4.4
Joe Jimenez 0.98 0.86 0.76 0.80 0.85 0 4.2
Paul Fry 0.83 0.87 0.82 0.80 0.87 0 4.2
Rafael Dolis 0.63 0.84 0.89 0.95 0.81 0 4.1
Ben Bowden 0.83 0.82 0.57 0.97 0.92 0 4.1
Jose Alvarado 0.96 0.85 0.89 0.87 0.51 0 4.1
Tanner Rainey 0.22 0.93 0.89 0.93 0.97 0 3.9
Sean Newcomb 0.62 0.93 0.79 0.93 0.65 0 3.9
Ryan Hendrix 1.00 0.73 0.51 0.82 0.85 0 3.9
Adam Ottavino 0.95 0.76 0.94 0.70 0.51 0 3.9
Rafael Montero 0.91 0.26 0.91 0.81 0.91 0 3.8
Enyel De Los Santos 0.55 0.94 0.44 0.95 0.90 0 3.8
Bryan Garcia 0.76 0.21 0.85 0.98 0.98 0 3.8
Tanner Scott 0.87 0.90 0.40 0.84 0.74 0 3.8
Aroldis Chapman 0.93 1.00 0.99 0.52 0.27 0 3.7
Sean Poppen 0.39 0.79 0.83 0.96 0.74 0 3.7
Alex Reyes 0.98 0.90 0.98 0.58 0.25 0 3.7
James Karinchak 0.97 0.96 0.94 0.35 0.47 0 3.7
Bryan Abreu 0.83 0.52 0.78 0.73 0.82 0 3.7
Sean Reid-Foley 0.90 0.89 0.36 0.77 0.75 0 3.7
Jeurys Familia 0.99 0.84 0.72 0.66 0.45 0 3.7
Rex Brothers 0.55 0.96 0.73 0.68 0.75 0 3.7
Paul Campbell 0.76 0.49 0.63 0.86 0.91 0 3.6
Trevor Megill 0.35 0.90 0.40 0.98 0.99 0 3.6
Lucas Sims 0.92 0.99 0.92 0.20 0.56 0 3.6
Pete Fairbanks 0.71 0.92 0.90 0.69 0.35 0 3.6
Jesus Luzardo 0.63 0.57 0.55 0.89 0.92 0 3.6
Gregory Soto 0.89 0.83 0.96 0.59 0.28 0 3.6
David Hess 0.91 0.32 0.32 1.00 1.00 0 3.5
Amir Garrett 0.04 0.89 0.92 0.83 0.86 0 3.5
Phil Maton 0.87 0.88 0.42 0.72 0.65 0 3.5
Jacob Barnes 0.26 0.78 0.87 0.73 0.88 0 3.5
Matt Barnes 0.94 0.98 0.98 0.21 0.41 0 3.5
Ryan Helsley 0.97 0.52 0.75 0.67 0.61 0 3.5
Cesar Valdez 0.36 0.47 0.93 0.91 0.84 0 3.5
Trevor May 0.94 0.92 0.86 0.43 0.35 0 3.5
Jacob Webb 0.98 0.45 0.79 0.79 0.50 0 3.5
Brad Brach 0.24 0.69 0.81 0.87 0.89 0 3.5
Adam Morgan 0.80 0.72 0.88 0.56 0.53 0 3.5
Luis Oviedo 0.24 0.62 0.63 0.99 0.99 0 3.5
Glenn Otto 0.10 0.83 0.61 0.93 0.99 0 3.5
Sam Howard 0.66 0.92 0.33 0.73 0.80 0 3.4
Reid Detmers 0.44 0.39 0.67 0.97 0.97 0 3.4
Michael Feliz 0.02 0.70 0.85 0.91 0.96 0 3.4
Aaron Ashby 0.90 0.86 0.80 0.28 0.61 0 3.4
Griffin Canning 0.80 0.51 0.54 0.75 0.80 0 3.4
Brett de Geus 0.59 0.21 0.65 0.95 0.98 0 3.4
J.B. Wendelken 0.87 0.33 0.83 0.78 0.54 0 3.4
Matt Foster 0.49 0.55 0.77 0.70 0.85 0 3.4
Aaron Bummer 0.86 0.92 0.81 0.44 0.33 0 3.4
Tyler Chatwood 0.22 0.80 0.80 0.72 0.81 0 3.4
Kyle Finnegan 0.75 0.56 0.93 0.74 0.34 0 3.3
Sean Doolittle 0.60 0.66 0.74 0.72 0.60 0 3.3
Justus Sheffield 0.85 0.17 0.37 0.98 0.95 0 3.3
Anthony Kay 0.20 0.79 0.59 0.92 0.81 0 3.3
Chris Stratton 0.86 0.67 0.90 0.50 0.37 0 3.3
Josh Sborz 0.67 0.78 0.72 0.67 0.45 0 3.3
Matt Andriese 0.33 0.59 0.74 0.88 0.75 0 3.3
Jake Diekman 0.45 0.95 0.90 0.56 0.43 0 3.3
Justin Garza 0.71 0.56 0.53 0.84 0.65 0 3.3
Cionel Perez 0.34 0.59 0.51 0.93 0.90 0 3.3
Shane Greene 0.01 0.57 0.83 0.91 0.96 0 3.3
Edwin Uceta 0.10 0.86 0.59 0.80 0.93 0 3.3
Michael Rucker 0.08 0.63 0.81 0.80 0.95 0 3.3
J.D. Hammer 0.47 0.70 0.51 0.88 0.70 0 3.3
Jeffrey Springs 0.95 0.96 0.83 0.20 0.30 0 3.2
Jake Brentz 0.78 0.80 0.80 0.47 0.38 0 3.2
Hirokazu Sawamura 0.89 0.77 0.68 0.71 0.19 0 3.2
Spencer Howard 0.11 0.62 0.65 0.88 0.97 0 3.2
Heath Hembree 0.25 0.96 0.92 0.30 0.79 0 3.2
Devin Williams 0.99 0.99 0.86 0.31 0.09 0 3.2
Seth Lugo 0.84 0.82 0.75 0.49 0.32 0 3.2
Zac Lowther 0.24 0.56 0.57 0.91 0.93 0 3.2
Hansel Robles 0.36 0.70 0.95 0.62 0.58 0 3.2
Paul Sewald 0.99 0.99 0.93 0.10 0.19 0 3.2
Mychal Givens 0.79 0.62 0.92 0.61 0.27 0 3.2
Edwin Diaz 0.80 0.96 0.99 0.13 0.31 0 3.2
Max Kranick 0.50 0.23 0.63 0.94 0.89 0 3.2
Mike Mayers 0.66 0.82 0.77 0.49 0.43 0 3.2
Trevor Stephan 0.42 0.81 0.71 0.66 0.57 0 3.2
Carlos Estevez 0.44 0.48 0.94 0.76 0.56 0 3.2
Anthony Castro 0.32 0.91 0.82 0.44 0.66 0 3.2
Jackson Kowar 0.10 0.44 0.62 1.00 1.00 0 3.2
Brad Hand 0.88 0.41 0.97 0.45 0.44 0 3.1
Daniel Norris 0.26 0.54 0.72 0.76 0.87 0 3.1
Camilo Doval 1.00 0.94 0.90 0.13 0.17 0 3.1
Kyle Zimmer 0.74 0.26 0.81 0.65 0.68 0 3.1
Nick Mears 0.36 0.50 0.66 0.90 0.71 0 3.1
Conner Greene 0.30 0.58 0.35 0.95 0.95 0 3.1
Greg Holland 0.52 0.43 0.91 0.57 0.68 0 3.1
Daniel Lynch 0.58 0.20 0.62 0.90 0.81 0 3.1
Demarcus Evans 0.10 0.87 0.60 0.81 0.74 0 3.1
Edward Cabrera 0.10 0.64 0.62 0.90 0.83 0 3.1
Eduardo Rodriguez 0.82 0.79 0.19 0.64 0.66 0 3.1
Kyle Keller 0.21 0.67 0.47 0.83 0.92 0 3.1
Nick Pivetta 0.57 0.74 0.68 0.51 0.60 0 3.1
Joely Rodriguez 0.36 0.54 0.75 0.81 0.63 0 3.1
Joakim Soria 0.17 0.66 0.92 0.61 0.72 0 3.1
Austin Voth 0.70 0.57 0.29 0.74 0.76 0 3.1
Aaron Slegers 0.64 0.20 0.30 0.98 0.95 0 3.1
Andrew Heaney 0.62 0.77 0.30 0.53 0.84 0 3.1
Victor Gonzalez 0.84 0.39 0.78 0.70 0.35 0 3.0
Kevin Ginkel 0.09 0.70 0.54 0.82 0.89 0 3.0
Dylan Cease 0.79 0.94 0.45 0.42 0.44 0 3.0
Brooks Raley 0.32 0.92 0.82 0.32 0.67 0 3.0
Mitch Keller 0.46 0.35 0.40 0.96 0.87 0 3.0
Stefan Crichton 0.05 0.11 0.93 0.99 0.96 0 3.0
Garrett Richards 0.49 0.24 0.75 0.87 0.68 0 3.0
Anthony Misiewicz 0.87 0.47 0.43 0.64 0.63 0 3.0
Archie Bradley 0.98 0.17 0.82 0.68 0.39 0 3.0
Alex Claudio 0.22 0.38 0.79 0.86 0.79 0 3.0
Yimi Garcia 0.69 0.60 0.96 0.25 0.51 0 3.0
Liam Hendriks 0.95 0.99 0.99 0.01 0.09 0 3.0
Jordan Romano 0.94 0.93 0.97 0.13 0.04 0 3.0
Cody Poteet 0.66 0.62 0.47 0.57 0.71 0 3.0
Alex Colome 0.61 0.31 0.96 0.64 0.49 0 3.0
Brady Singer 0.29 0.54 0.66 0.82 0.69 0 3.0
Raisel Iglesias 0.91 0.98 0.99 0.04 0.10 0 3.0
Sam Coonrod 0.42 0.75 0.84 0.53 0.47 0 3.0
Bruce Zimmermann 0.62 0.29 0.58 0.79 0.72 0 3.0
Darwinzon Hernandez 0.47 0.93 0.53 0.77 0.28 0 3.0
Junior Guerra 0.76 0.38 0.05 0.94 0.86 0 3.0
Codi Heuer 0.92 0.22 0.79 0.53 0.54 0 3.0
Andrew Wantz 0.28 0.95 0.66 0.40 0.70 0 3.0
Hector Neris 0.51 0.91 0.93 0.26 0.37 0 3.0
Sam Hentges 0.12 0.51 0.45 0.96 0.93 0 3.0
Miguel Sanchez 0.77 0.31 0.56 0.85 0.49 0 3.0
Rowan Wick 0.04 0.86 0.95 0.58 0.54 0 3.0
Humberto Mejia 0.10 0.33 0.60 0.98 0.96 0 3.0
Jose Cisnero 0.64 0.53 0.87 0.54 0.38 0 3.0
Richard Lovelady 0.91 0.74 0.85 0.15 0.32 0 3.0
Brad Boxberger 0.78 0.90 0.86 0.15 0.26 0 3.0
Shane McClanahan 0.81 0.76 0.62 0.46 0.31 0 3.0
Phillips Valdez 0.47 0.29 0.76 0.58 0.84 0 2.9
Josh Fleming 0.89 0.04 0.69 0.59 0.73 0 2.9
Brusdar Graterol 0.87 0.20 0.58 0.66 0.62 0 2.9
Jose Quintana 0.02 0.93 0.13 0.94 0.91 0 2.9
Kyle Funkhouser 0.92 0.37 0.71 0.65 0.30 0 2.9
Daniel Ponce de Leon 0.21 0.10 0.86 0.89 0.87 0 2.9
JC Mejia 0.13 0.33 0.59 0.88 0.99 0 2.9
Wandy Peralta 0.90 0.24 0.89 0.62 0.27 0 2.9
Michael Fulmer 0.72 0.61 0.94 0.47 0.17 0 2.9
Alex Lange 0.19 0.71 0.78 0.77 0.47 0 2.9
Brad Keller 0.59 0.32 0.31 0.92 0.77 0 2.9
Shawn Armstrong 0.18 0.84 0.17 0.77 0.94 0 2.9
Braxton Garrett 0.20 0.40 0.63 0.97 0.71 0 2.9
Diego Castillo 0.84 0.89 0.96 0.08 0.13 0 2.9
Chris Rodriguez 0.68 0.50 0.64 0.71 0.38 0 2.9
Josiah Gray 0.19 0.67 0.66 0.60 0.78 0 2.9
Brandon Bielak 0.59 0.36 0.74 0.63 0.59 0 2.9

4 Creating Model File

4.1 Additional Data Prep

4.1.1 Remove Variables which are based off current hitting numbers

Not all variables can be used for predictive modeling. Variables that go into the percentile ranking or are non-normalized metrics created after the fact (such as WAR - Wins above Replacement or RS - Raw Run Support) should be removed. However, metrics that are normalized by a per pitch basis (such as wFB/C) can remain as we expect pitchers to have similar performance in these metrics one year out.

#Be careful about RS - Run Support and RS/9

#Creating a new dataset to keep original intact
df_pitching_init3 = df_pitching_init2 %>% 
  select (-Name)

Lagged Percentile (_share) Variables can be used for predictive modeling. However since these variables were created for the Worth metric they must also be removed for modeling purposes.


#Order the dataset by lag columns
df_pitching_init4 =  arrange(df_pitching_init3, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_pitcher2 = data.table(df_pitching_init4)

#designate columns to lag - just the new shares
cols1 = (c('Wins_share','SO_share','SV_share', 'ERA_share','WHIP_share','HLD_share','Worth'))
anscols = paste("lag", cols1, sep="_") 
DT_pitcher2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

df_pitching_final = as.data.frame(DT_pitcher2) %>% 
  select(-c(Wins_share,SO_share,SV_share, ERA_share,WHIP_share,HLD_share))%>%
select(-FIP,-(RAR:WPA),-(wFB:wCH),-(`ERA-`:`xFIP-`),
       -SIERA,-(`RA9-WAR`:`Age Rng`),-kwERA,-(`wCH (pi)`:`wSL (pi)`),-(`K/9+`:`HR/FB%+`)) %>% select(-W,-SO,-SV,-HLD,-W_IP,-SO_IP,-SV_IP,-WHIP,-ERA,-HLD_IP)

4.1.2 Creating Training/Test Split

We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)


set.seed(15674)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_pitching_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_pitching_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_pitching_final[-inTrain,]

nrow(tr_2021)/nrow(df_pitching_final) #check if split is 0.8
[1] 0.8

4.1.3 Treat Missing Data by Imputing Mean Value

Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found here.

treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = tr_2021, # training data
  varlist = colnames(tr_2021) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages

#clean stands for cleaned numerical variable, isBAD indicates that a value replacement has occurred (which indicates a missing value in this case), and lev is a binary indicator whether a particular value of that categorical variable was present.  

#### Checking Scoreframe

score_frame <- treat_plan_2021$scoreFrame %>% 
  select(varName, origName, code)

head(score_frame)


tr_treated_2021 <- vtreat::prepare(treat_plan_2021, tr_2021)
te_treated_2021 <- vtreat::prepare(treat_plan_2021, te_2021)


treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = DT_pitcher2, # training data
  varlist = colnames(DT_pitcher2) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages


total_treated_2021_pitching <- vtreat::prepare(treat_plan_2021, DT_pitcher2)

#tr_treated = tr
#te_treated = te

dim(tr_treated_2021) #note there are dummies for each player and team
[1] 3196 1415

4.1.4 Check Distribution of Training Population

The population used for Training should be indicative of Total Population


ggplot2::qplot(tr_treated_2021$Worth, main="Training Set") + geom_histogram(colour="black", fill="steelblue") + theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#The skewness is actually a bit better than the overall data set
skewness(tr_treated_2021$Worth) 
[1] 0.077

5 Running XGboost Model

To keep things simple with modeling, we’ll turn the training data into simple input variables for caret::train, dropping the response variable and converting the data frame to a matrix. Documentation for this approach to XGboost can be found here.

5.1 Tuning the Model

5.1.1 Initial Non-Tuned Model

Break the data set into x and y inputs with x being a matrix. "_isBAD" is a category created by the Vtreat package in case you want to identify rows

input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>%                      
   select(!ends_with ("_isBAD")))

input_y <- tr_treated_2021$Worth

XGBoost with Default Hyperparameters:
The Variable Importance (caret::varImp(xgb_base_2021, scale = F) from the caret package shows the contribution of each variable to the initial model. Since this is untuned, we can expect the percentage imporantance to change as the models iterate through potential hyperparameters.
XGBoost documentation can be found for more general models here.


#Defaults for xgboost model
grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

#This is a blank train_control set, this will be updated after
train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
xgbTree variable importance

  only 20 most important variables shown (out of 773)

5.2 Further Variable Selection

5.2.1 Remove redundant and highly correlated variables

Selection Removal Step 1: Check for high correlations
Normally, this step is done early, but those steps were reserved for preparing the data


dep_cor1 <- t(as.data.frame(cor(tr_treated_2021[ , colnames(tr_treated_2021) != "Worth"],
                tr_treated_2021$Worth)))
dep_cor1 <-
as.data.frame(t(as.data.frame(dep_cor1)%>% 
  select(!starts_with("lag")) %>% #remove lag variables
  select(!contains("_isBAD")))) 

dep_cor1 <- tibble::rownames_to_column(dep_cor1,"VARIABLES")%>% #remove indicators for missing data
  filter(V1 > 0.40|V1 < -0.3)

dep_cor1

dep_cor2 <- colnames(row_to_names(t(dep_cor1),row_number = 1))

Let’s Remove variables with high correlation to worth metric, and metrics that are calculated after a player’s performance (such as WPA/RE24) or redundant (RS_IP)


input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>% #Remove some variables variables
     select (-RS_IP,-ER_IP,-R_IP,-REW,-RE24,-Clutch,-WPA_slash_LI,-Season #Remove redundant variables or non/weighted variables
) %>%      
select(!ends_with ("_isBAD"))) #indicator variable for missing data

input_y <- tr_treated_2021$Worth

Run the model on the new dataset to make sure the variable importances look fine


#Note Training parameters were set in initial model set up
xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.3 Model with new data

5.3.1 Tuning All Hyperparameters

A tune grid allows us to test a large amount of hyper-parameters and find the model with the lowest RMSE for predictions.
However, The more values you want to test and the greater the amount of Cross-Fold Validations (method = "cv"), the greater the computational time it will take. More information on the specific parameters can be found here.


# maximum number of trees
nrounds <- 1000

# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid <- expand.grid(
  nrounds = seq(from = 100, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.075, 0.1),
  max_depth = c(2, 4, 6, 8, 10),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  ## Note this was # out in the original code
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = FALSE # FALSE for reproducible results 
)

Running the initial tuning model

#Note I will be timing these runs to give an estimate on how long this model takes to run
start_time <- Sys.time()

xgb_tune_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = FALSE
  ,verbosity = 0
)

end_time <- Sys.time()

end_time - start_time
Time difference of 21 mins

Tuning Plot and Variable Importance

varImp(xgb_tune_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 765)
# helper function for the plots
tuneplot <- function(x, probs = .90) {
  ggplot(x) +
    coord_cartesian(ylim = c(quantile(x$results$RMSE, probs = probs), min(x$results$RMSE))) +
    theme_bw()
}

tuneplot(xgb_tune_2021)


5.3.2 Fine Tuning Model

5.3.2.1 Second Tuning: Maximum Depth and Minimum Child Weight

After fixing the learning rate to the best tune from the previous iteration and we’ll also set maximum depth to 3 +-1 (or +2 if max_depth == 2) to experiment a bit around the suggested best tune in previous step. Then, well fix maximum depth and minimum child weight.

tune_grid2 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = ifelse(xgb_tune_2021$bestTune$max_depth == 2,
    c(xgb_tune_2021$bestTune$max_depth:4),
    xgb_tune_2021$bestTune$max_depth - 1:xgb_tune_2021$bestTune$max_depth + 1),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune2_2021)


xgb_tune2_2021$bestTune

varImp(xgb_tune2_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.3.2.2 Third Tuning: Column and Row Sampling


tune_grid3 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = 0,
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = c(0.5, 0.75, 1.0)
)

xgb_tune3_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune3_2021, probs = .95)


xgb_tune3_2021$bestTune

varImp(xgb_tune3_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.3.2.3 Fourth Tuning: Gamma

Next, we again pick the best values from previous step, and now will see whether changing the gamma has any effect on the model fit:

tune_grid4 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = c(0, 0.05,0.1, 0.2,0.4, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)

xgb_tune4_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid4,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune4_2021)
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult
to discriminate; you have 9. Consider specifying shapes manually if you must have them.
Warning: Removed 60 rows containing missing values (geom_point).

xgb_tune4_2021$bestTune

varImp(xgb_tune4_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.3.2.4 Fifth Tuning: Reducing the Learning Rate

Now, we have tuned the hyperparameters and can start reducing the learning rate to get to the final model:

start_time <- Sys.time()

tune_grid5 <- expand.grid(
  nrounds = seq(from = 100, to = 10000, by = 75),
   eta = c(0.01, 0.015, 0.025,0.035, 0.05,0.75, 0.1),
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = xgb_tune4_2021$bestTune$gamma,
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)



xgb_tune5_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid5,
  method = "xgbTree",
  verbose = TRUE
)

#tuneplot(xgb_tune5_2021)

end_time <- Sys.time()

end_time - start_time
Time difference of 28 mins
xgb_tune5_2021$bestTune

varImp(xgb_tune5_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.3.2.5 Fitting Final Model


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))
eXtreme Gradient Boosting 

3196 samples
 765 predictor

No pre-processing
Resampling: None 
varImp(xgb_model_2021, scale = F  ) 
xgbTree variable importance

  only 20 most important variables shown (out of 765)

5.4 Model Performance

5.4.1 Checking Model on Test Split Data

We don’t need to look too closely at are training data as Xgboost will heavily overfit the model based on that data. The more important part is how the model performs on in predicting our Test Sample that was not included.



y_pred_test <- predict(xgb_model_2021, data.matrix(te_treated_2021))

test_stats= cbind((te_treated_2021$Worth),y_pred_test)

test_statsR2 = cor(test_stats[,1],test_stats[,2])^2

print(test_statsR2)
[1] 0.71
y_pred_train <- predict(xgb_model_2021, data.matrix(tr_treated_2021))

train_stats = cbind((tr_treated_2021$Worth),y_pred_train)

train_statsR2 = cor(train_stats[,1],train_stats[,2])^2

print(train_statsR2)
[1] 0.98
#test dataset
x <- select(te_treated_2021, -Worth)
y <- (te_treated_2021$Worth)

(xgb_model_rmse <- ModelMetrics::rmse(y, predict(xgb_model_2021, newdata = x)))
[1] 0.34
holdout_x <- select(tr_treated_2021, -Worth)
holdout_y <- tr_treated_2021$Worth

(xgb_model_rmse <- ModelMetrics::rmse(holdout_y, predict(xgb_model_2021, newdata = holdout_x)))
[1] 0.081

5.4.1.1 Graphical Representation of Model


ggplot2::ggplot() +
  aes(x = test_stats[,1], y = test_stats[,2]) +
  geom_jitter() +
  xlab("Predicted Values") +
  ylab("Actual Values") +
  ggtitle("Results of Pitching Model on Test Data")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'


6 Creating 2022 Projections from Model

6.1 Re-fit model for Important Variables

Now that we have an acceptable model, we can use it to create projections for how well we think players should do in 2022 based on their hitting statistics in 2021. First let’s reduce

Step 1: Only keep variables with high enough importance in model



vip(xgb_model_2021, num_features = 30)  # 10 is the default, 30 gives a visual on the top 30 most important features of the model


unscalevi = vi(xgb_model_2021, method="model") #shows the numbers behind the plot

unscalevi$Importance_perc = with(unscalevi,Importance/sum(Importance)) #adds percentages 

unscalevi # importance by variables

variables_to_keep_2021 = subset(unscalevi, Importance_perc > 0.0010) %>% select(Variable) #Keep Variables that explain at least a small amount [0.1%] of the model. This is a low threshold for inclusion ,but you can adjust this

variables_to_keep_2021b = t(variables_to_keep_2021)

variables_to_keep_2022 = colnames(row_to_names(variables_to_keep_2021b,row_number = 1))

tr_treated_2022 = tr_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_")) #keep modeled important variables along with team indicator variables

te_treated_2022 = te_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))

input_x_2022 = as.matrix(select(tr_treated_2022, -Worth))

input_y_2022 = tr_treated_2022$Worth

Step 2: Re-fit model with reduced variable scope
Note from the best tune below the nrounds - is the max I set above and eta is at the lowest possible value. This could cause potential overfitting issues, but from our Actual vs. Predicted Graph we know this not to be the case.



(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2022 <- caret::train(
  x = input_x_2022,
  y = input_y_2022,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))
eXtreme Gradient Boosting 

3196 samples
 125 predictor

No pre-processing
Resampling: None 
vip(xgb_model_2022, num_features = 30)


unscalevi24 = vi(xgb_model_2022, method="model")

unscalevi24$Importance_perc = with(unscalevi24,Importance/sum(Importance)) 

unscalevi24

# Save work for later prediction

save(xgb_model_2022,file = '2022_Pitching5x5_Model.Rdata')

pitching5x5 = xgb_model_2022

pitchinginput = input_x_2022

6.2 Get 2022 list of players

6.2.1 Arrange the Data so the Columns are in the exact order as the model

First let’s prepare a file for predicting based on our model object



variableslag5x= row_to_names(as.data.frame(t(variables_to_keep_2022)),row_number = 1)  %>% select (starts_with("lag"))

variables_nolag5x = (owmr::remove_prefix(variableslag5x,"lag" , sep = "_"))

Data_Predict_2022a5x = total_treated_2021_pitching %>% select (one_of(colnames(variables_nolag5x)),Season,playerid)

colnames(Data_Predict_2022a5x) <- paste0("lag_", colnames(Data_Predict_2022a5x))

Data_Predict_2022b5x = total_treated_2021_pitching %>% select (one_of(colnames(variables_nolag5x)))
colnames(Data_Predict_2022b5x) = colnames(variableslag5x)

variables_to_keep_2022_nolag5x = total_treated_2021_pitching %>% select(one_of(variables_to_keep_2022),Season,playerid,starts_with("Team_lev_x_"))%>% select(-one_of(colnames(Data_Predict_2022b5x)))


Data_predict_20225x = sqldf(
  "
  select a.*,b.* from
  Data_Predict_2022a5x a,
  variables_to_keep_2022_nolag5x b
  on b.playerid = a.lag_playerid
  and b.Season = a.lag_Season
  "
) %>% select(-lag_playerid,lag_Season) %>%
  filter(Season == 2021) %>% 
  select(one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))

6.3 Create Predictions for Model

6.3.1 Run Projections on Players who Played in 2021

This is the raw prediction score per IP for each pitcher


pitching_predictions5x = as.data.frame(predict(xgb_model_2022,Data_predict_20225x))

names(pitching_predictions5x) = c("Predict_Score")

Data_predict_2022_w_Pitching_Predictions5x = cbind(Data_predict_2022,pitching_predictions5x) %>% select(playerid,Predict_Score)

head(Data_predict_2022_w_Pitching_Predictions5x)
NA

6.3.2 Load in Latest 2022 Projections for Innings Pitched

Downloaded from FanGraphs here.

Latest_2022_pitchingdata_FP = read_csv("FanGraph_Fantasy_Baseball_Pitching.csv")
Rows: 817 Columns: 27
-- Column specification -------------------------------------------------------------------------------------------
Delimiter: ","
chr  (3): Name, Team, playerid
dbl (24): GS, G, IP, W, L, QS, SV, HLD, H, ER, HR, SO, BB, WHIP, K/9, BB/9, ERA, FIP, WAR, RA9-WAR, ADP, InterS...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Latest_2022_pitchingdata_FP
NA

As you can see from the chart below there aren’t many elite pitchers in the 87+ Predict score range.



Pitching_Data_NonAdj_Projections5x = sqldf(
  "
  select a.*,b.Predict_Score
  from Latest_2022_pitchingdata_FP a 
  left join 
  Data_predict_2022_w_Pitching_Predictions5x b
  on a.playerid = b.playerid
  "
) %>% filter(ADP<370 | is.na(Predict_Score)==F)


Pitching_Data_Adj_Projections5x =
Pitching_Data_NonAdj_Projections5x %>% 
  mutate(
    Avg_IP = 60,
    AdjPredict_Score_raw = ifelse(is.na(Predict_Score),NA,Predict_Score*(IP/Avg_IP)),
    max_predscore= max(AdjPredict_Score_raw,na.rm = T),
    AdjPredict_Score = ifelse (is.na(AdjPredict_Score_raw),NA,AdjPredict_Score_raw *100/max_predscore),
    WAR_rank = order(order(rank(WAR,ties.method = 'average'),decreasing = TRUE)),
    AdjPredict_Score_Rank = order(order(rank(AdjPredict_Score,ties.method = 'average'),decreasing = TRUE))-sum(is.na(AdjPredict_Score)),
        Ranks_Above_ADP = ADP - AdjPredict_Score_Rank
  ) %>% select (Name,ADP,WAR, WAR_rank,AdjPredict_Score ,AdjPredict_Score_Rank,Ranks_Above_ADP)


  

ggplot2::qplot(Pitching_Data_Adj_Projections5x$AdjPredict_Score, main="Predictions") + geom_histogram(colour="black", fill="grey") + theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


7 2022 Projections Full

7.1 Table of Pitching Projections (Players who Didn’t Play in 2021 - Recieve an NA)

AdjPredict_Score are normalized to 100


tableexport =
Pitching_Data_Adj_Projections5x %>%
  arrange (ADP,WAR) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)

save_kable(tableexport,file = "Pitching5x5.html")

#tableexport

This is a better formatted Table




ft_dt <- Pitching_Data_Adj_Projections5x[1:nrow(Pitching_Data_Adj_Projections5x), 1:ncol(Pitching_Data_Adj_Projections5x)] %>% 
  filter(AdjPredict_Score_Rank>0)%>%  arrange((AdjPredict_Score_Rank))

ft_dt$ADP <- color_tile("white", "red")(ft_dt$ADP)

ft_dt$WAR <- color_bar("lightblue")(ft_dt$WAR)

ft_dt$AdjPredict_Score<- color_bar("lightblue")(ft_dt$AdjPredict_Score)

ft_dt$WAR_Rank <- color_tile("green","orange")(ft_dt$WAR_rank)

ft_dt$Predict_Rank <- color_tile("green","orange")(ft_dt$AdjPredict_Score_Rank) 


ft_dt$Ranks_Above_ADP <- 
  ifelse(
  ft_dt$Ranks_Above_ADP < 0,
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "red", italic = T),
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "green", italic = T)
)


ft_dt2 <- ft_dt[c("Name", "ADP", "WAR", "AdjPredict_Score", "WAR_Rank","Predict_Rank","Ranks_Above_ADP")]



table_export = 
kbl(ft_dt2, escape = F) %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T) %>%   column_spec(6, width = "3cm") %>%
  add_header_above(c(" ", "Scores" = 3, "Ranks" = 2," "))
save_kable(table_export,file = "Pitching5x5_updated.html")
  
table_export  
Scores
Ranks
Name ADP WAR AdjPredict_Score WAR_Rank Predict_Rank Ranks_Above_ADP
Dylan Cease 81.2 3.3 100.0 20 1 80.2
Eduardo Rodriguez 150.1 3.5 99.8 15 2 148.1
Shane Bieber 31.2 4.4 93.2 4 3 28.2
Aaron Nola 41.0 4.2 89.8 7 4 37
Nick Pivetta 350.8 1.7 88.7 86 5 345.8
Tarik Skubal 197.6 2.0 88.0 61 6 191.6
Germán Márquez 261.7 2.9 88.0 29 7 254.7
Tyler Mahle 132.2 2.6 87.8 39 8 124.2
Brady Singer 492.4 2.2 87.8 53 9 483.4
Jon Gray 236.9 2.3 87.5 51 10 226.9
Shane McClanahan 109.6 2.6 86.3 40 11 98.6
Brad Keller 582.1 1.6 85.3 91 12 570.1
Luis Castillo 103.2 3.3 84.9 21 13 90.2
Zac Gallen 154.3 2.3 84.8 50 14 140.3
Josiah Gray 290.7 1.6 84.3 89 15 275.7
José Quintana 575.5 1.0 84.2 143 16 559.5
Patrick Corbin 456.0 1.6 82.6 95 17 439
Alek Manoah 93.0 2.8 82.0 33 18 75
Gerrit Cole 7.1 5.0 81.9 3 19 -11.9
Yusei Kikuchi 301.3 1.7 81.1 83 20 281.3
Sean Manaea 141.4 2.8 80.6 31 21 120.4
Logan Gilbert 161.8 2.2 80.0 52 22 139.8
Lucas Giolito 44.2 3.7 79.5 10 23 21.2
Logan Webb 69.9 3.7 79.1 11 24 45.9
Andrew Heaney 299.4 1.4 78.9 102 25 274.4
Blake Snell 113.4 2.4 78.6 49 26 87.4
Frankie Montas 92.8 3.3 78.6 19 27 65.8
Corbin Burnes 10.8 5.3 77.9 1 28 -17.2
Freddy Peralta 54.0 3.6 77.9 12 29 25
Kyle Hendricks 279.8 1.7 77.6 79 30 249.8
Jesús Luzardo 289.5 0.8 76.2 158 31 258.5
Walker Buehler 18.6 3.8 75.6 8 32 -13.4
Framber Valdez 149.6 3.1 75.0 26 33 116.6
Aaron Civale 269.5 1.5 74.8 96 34 235.5
José Berríos 77.5 3.2 74.6 23 35 42.5
Steven Matz 249.2 1.9 74.5 67 36 213.2
Luis Garcia 144.7 2.0 74.4 64 37 107.7
Robbie Ray 49.9 3.0 74.2 27 38 11.9
Kevin Gausman 71.4 3.2 73.8 24 39 32.4
Ian Anderson 151.1 2.1 73.5 56 40 111.1
Sonny Gray 168.2 2.7 73.3 38 41 127.2
Yu Darvish 100.3 2.7 72.2 34 42 58.3
Kyle Gibson 393.9 1.9 72.1 71 43 350.9
Cole Irvin 532.2 1.5 71.8 97 44 488.2
Nathan Eovaldi 135.1 3.6 71.6 13 45 90.1
Trevor Rogers 92.9 3.1 71.2 25 46 46.9
Mitch Keller 531.5 1.2 71.0 123 47 484.5
Triston McKenzie 229.0 1.6 70.7 93 48 181
Julio Urías 34.7 3.5 70.5 14 49 -14.3
Jordan Montgomery 208.7 2.7 70.4 35 50 158.7
Patrick Sandoval 196.0 2.4 69.9 46 51 145
Zac Lowther 999.0 0.8 69.2 159 52 947
Antonio Senzatela 585.0 2.0 68.9 62 53 532
JT Brubaker 537.7 1.4 68.3 105 54 483.7
Kris Bubic 536.0 1.1 68.1 125 55 481
Bruce Zimmermann 600.8 1.3 67.9 114 56 544.8
Shohei Ohtani 9.2 2.8 67.9 32 57 -47.8
Dylan Bundy 498.5 1.3 67.7 109 58 440.5
Luis Patiño 290.7 1.5 67.5 99 59 231.7
Mike Minor 532.2 2.0 67.5 63 60 472.2
Kyle Freeland 574.4 1.5 67.4 98 61 513.4
Max Fried 69.8 3.5 67.3 16 62 7.8
Zach Plesac 346.6 1.5 66.8 100 63 283.6
Dane Dunning 484.6 1.9 66.7 72 64 420.6
Hyun Jin Ryu 207.3 2.4 66.6 45 65 142.3
Alex Cobb 242.9 1.9 66.4 70 66 176.9
Brandon Woodruff 21.2 4.2 66.3 6 67 -45.8
Tanner Houck 206.4 2.4 66.2 47 68 138.4
Joe Ryan 221.5 2.2 66.1 54 69 152.5
Jordan Lyles 578.9 0.6 65.8 196 70 508.9
Joe Musgrove 70.6 3.2 65.4 22 71 -0.4
Max Scherzer 19.7 3.8 65.1 9 72 -52.3
Dallas Keuchel 567.3 1.2 64.9 119 73 494.3
Alex Wood 219.9 1.7 64.8 78 74 145.9
Carlos Hernández 412.5 1.2 64.4 121 75 337.5
Bailey Ober 261.3 2.2 64.4 55 76 185.3
Chris Flexen 413.6 1.9 64.0 73 77 336.6
Michael Kopech 161.1 2.5 63.5 43 78 83.1
Matt Manning 543.4 1.1 63.1 124 79 464.4
Sandy Alcantara 43.4 3.3 63.0 18 80 -36.6
Huascar Ynoa 245.8 1.8 63.0 76 81 164.8
Zack Wheeler 31.0 4.4 62.9 5 82 -51
Cristian Javier 284.0 1.2 62.8 118 83 201
Charlie Morton 94.3 2.9 62.5 30 84 10.3
Austin Gomber 551.9 1.2 62.4 116 85 466.9
Reid Detmers 427.4 1.1 62.4 134 86 341.4
Lance Lynn 70.0 3.4 62.0 17 87 -17
Keegan Akin 600.0 1.0 61.8 139 88 512
Chris Bassitt 128.5 2.4 61.7 48 89 39.5
Ranger Suárez 179.3 2.4 61.2 44 90 89.3
Carlos Rodón 110.2 2.9 61.2 28 91 19.2
Merrill Kelly 526.0 1.7 60.8 81 92 434
Spencer Howard 575.8 0.9 60.6 150 93 482.8
Vladimir Gutierrez 585.4 0.2 59.4 322 94 491.4
Corey Kluber 324.4 1.1 58.9 130 95 229.4
Cal Quantrill 267.3 1.7 58.9 82 96 171.3
Carlos Carrasco 284.6 1.3 58.6 110 97 187.6
Casey Mize 279.4 1.7 58.5 84 98 181.4
Zach Davies 595.9 0.2 58.4 343 99 496.9
James Kaprielian 448.3 1.1 58.4 126 100 348.3
Aaron Ashby 264.1 1.5 58.2 101 101 163.1
Ryan Yarbrough 541.5 1.4 58.0 106 102 439.5
Marco Gonzales 319.6 1.7 57.9 80 103 216.6
Erick Fedde 589.7 0.6 57.8 188 104 485.7
Pablo López 140.6 2.7 57.6 36 105 35.6
Tony Gonsolin 300.6 0.9 57.4 148 106 194.6
J.A. Happ 597.0 0.6 57.4 199 107 490
Elieser Hernandez 339.6 1.0 57.3 145 108 231.6
Eric Lauer 306.3 1.4 57.3 103 109 197.3
José Suarez 535.1 1.3 57.2 113 110 425.1
Taylor Hearn 591.2 0.7 56.8 183 111 480.2
Wade Miley 502.0 1.2 56.5 120 112 390
Marcus Stroman 187.5 2.6 56.5 41 113 74.5
Alec Mills 594.1 0.6 55.6 211 114 480.1
Chris Sale 85.9 1.9 55.4 66 115 -29.1
Anthony DeSclafani 212.6 1.8 54.9 77 116 96.6
Jack Flaherty 115.4 1.6 54.8 87 117 -1.6
Jameson Taillon 299.6 2.1 54.7 59 118 181.6
Clayton Kershaw 149.4 2.5 54.2 42 119 30.4
José Urquidy 212.1 2.0 54.1 65 120 92.1
Taijuan Walker 414.5 0.6 53.7 185 121 293.5
Madison Bumgarner 507.1 1.1 53.7 131 122 385.1
Michael Lorenzen 543.3 0.6 53.4 203 123 420.3
Zach Eflin 485.0 1.9 53.3 69 124 361
Jacob deGrom 19.7 5.1 53.0 2 125 -105.3
Tylor Megill 334.2 1.1 52.7 129 126 208.2
Bryse Wilson 600.5 0.5 52.3 223 127 473.5
Chris Paddack 436.8 1.2 52.2 117 128 308.8
Wil Crowe 600.8 0.1 51.9 379 129 471.8
Drew Rasmussen 282.4 1.6 51.5 88 130 152.4
Stephen Strasburg 294.0 1.3 51.5 111 131 163
Drew Smyly 577.4 0.6 51.3 208 132 445.4
Daniel Bard 579.3 0.4 51.1 266 133 446.3
Adrian Houser 465.0 1.4 50.5 104 134 331
Luke Weaver 514.4 1.1 50.3 128 135 379.4
Martín Pérez 598.8 0.8 49.4 168 136 462.8
John Means 217.1 2.7 49.2 37 137 80.1
Daniel Lynch 559.5 0.7 48.9 174 138 421.5
Michael Wacha 549.7 0.9 48.9 149 139 410.7
A.J. Alexy 596.2 0.6 48.5 186 140 456.2
Edward Cabrera 523.6 0.6 47.9 202 141 382.6
Glenn Otto 582.9 0.9 47.2 152 142 440.9
José Alvarado 587.6 0.5 47.2 237 143 444.6
Adam Wainwright 194.7 2.1 46.9 58 144 50.7
Zack Greinke 312.8 1.6 46.7 90 145 167.8
Adam Ottavino 598.1 0.2 46.3 328 146 452.1
James Karinchak 468.6 0.7 45.3 171 147 321.6
Nestor Cortes 333.1 1.4 45.3 107 148 185.1
Randy Dobnak 600.4 0.8 45.3 162 149 451.4
Aroldis Chapman 83.3 0.9 45.2 151 150 -66.7
Rich Hill 477.4 0.7 44.9 173 151 326.4
Garrett Whitlock 239.2 1.2 44.7 122 152 87.2
Johnny Cueto 585.8 0.3 44.7 283 153 432.8
Justus Sheffield 599.5 0.4 44.0 269 154 445.5
Michael Pineda 465.2 1.4 43.9 108 155 310.2
Josh Fleming 600.8 0.5 43.8 222 156 444.8
Trevor May 565.3 0.4 43.6 239 157 408.3
Kyle Finnegan 351.5 0.2 43.4 336 158 193.5
Matt Barnes 245.2 0.8 43.2 164 159 86.2
Tyler Anderson 564.5 0.7 43.0 172 160 404.5
Paul Sewald 289.0 0.5 43.0 234 161 128
Jake Diekman 585.9 0.5 42.9 232 162 423.9
Héctor Neris 493.9 0.4 42.9 242 163 330.9
Tanner Rainey 371.9 0.2 42.8 325 164 207.9
Gregory Soto 194.3 0.5 42.6 228 165 29.3
Miles Mikolas 512.8 1.6 42.5 94 166 346.8
Amir Garrett 586.6 0.2 42.5 333 167 419.6
Phil Maton 600.4 0.4 42.2 248 168 432.4
Dillon Peters 600.7 0.0 42.1 421 169 431.7
Scott Barlow 159.3 0.9 41.8 155 170 -10.7
Carlos Estévez 528.6 0.2 41.7 337 171 357.6
David Peterson 595.2 0.7 41.6 175 172 423.2
Lance McCullers Jr. 260.0 1.7 41.4 85 173 87
Pete Fairbanks 517.3 0.6 40.9 194 174 343.3
Lucas Sims 243.6 0.7 40.9 180 175 68.6
Justin Steele 598.7 0.4 40.8 245 176 422.7
Paolo Espino 599.2 0.2 40.7 327 177 422.2
Edwin Díaz 63.5 1.1 40.6 135 178 -114.5
Aaron Bummer 595.4 1.0 40.6 137 179 416.4
Chad Green 403.2 1.0 40.4 147 180 223.2
Seth Lugo 598.1 0.5 40.3 236 181 417.1
Chris Stratton 519.4 0.3 40.2 284 182 337.4
Tanner Scott 599.0 0.7 40.0 182 183 416
David Price 581.5 0.5 39.4 216 184 397.5
Zach Thompson 546.8 1.0 39.3 146 185 361.8
Mike Mayers 594.7 0.5 39.2 221 186 408.7
Griffin Canning 594.2 0.5 39.1 213 187 407.2
Sam Hentges 999.0 0.3 38.9 306 188 811
Devin Williams 287.1 1.1 38.9 133 189 98.1
Heath Hembree 600.8 0.2 38.7 338 190 410.8
Mychal Givens 549.7 0.1 38.5 419 191 358.7
Alex Colomé 378.0 0.1 38.5 387 192 186
Domingo Germán 494.6 1.1 38.4 132 193 301.6
Raisel Iglesias 47.6 1.0 38.1 138 194 -146.4
Brad Boxberger 588.3 0.2 38.1 331 195 393.3
Brent Suter 587.6 0.4 38.1 268 196 391.6
Giovanny Gallegos 113.5 1.0 38.1 140 197 -83.5
Daniel Norris 999.0 0.1 38.0 398 198 801
Hansel Robles 596.0 -0.1 37.9 505 199 397
Austin Voth 600.4 0.0 37.8 420 200 400.4
Jaime Barría 999.0 0.5 37.7 238 201 798
Jordan Romano 88.3 0.8 37.7 160 202 -113.7
Kyle Funkhouser 999.0 0.1 37.6 386 203 796
Will Smith 120.8 0.5 37.4 227 204 -83.2
Taylor Rogers 172.5 1.1 37.3 127 205 -32.5
Yimi García 581.1 0.2 37.3 352 206 375.1
Tyler Wells 454.6 0.9 37.2 153 207 247.6
Camilo Doval 159.8 0.5 37.1 229 208 -48.2
Justin Dunn 577.6 0.6 37.1 198 209 368.6
Joe Jiménez 999.0 0.2 36.9 329 210 789
Garrett Richards 598.7 0.4 36.8 272 211 387.7
Hirokazu Sawamura 589.6 0.0 36.6 443 212 377.6
Jake Odorizzi 533.5 0.8 36.6 163 213 320.5
Ryne Stanek 598.8 0.2 36.6 330 214 384.8
J.C. Mejía 999.0 0.1 36.4 389 215 784
Bryan Shaw 597.8 -0.1 36.3 507 216 381.8
Luis Gil 525.9 0.8 36.2 165 217 308.9
Rowan Wick 302.0 0.4 36.2 249 218 84
Rafael Dolis 999.0 0.0 36.1 434 219 780
Tyler Alexander 586.8 0.8 36.1 167 220 366.8
Liam Hendriks 32.0 1.6 36.1 92 221 -189
J.B. Wendelken 600.1 0.2 36.0 341 222 378.1
Brad Hand 520.0 0.2 35.9 351 223 297
Jeurys Familia 597.5 0.3 35.8 305 224 373.5
Trevor Stephan 999.0 0.1 35.8 368 225 774
Michael Fulmer 347.0 0.6 35.8 192 226 121
Griffin Jax 600.8 0.1 35.7 363 227 373.8
Ryan Pressly 64.3 1.0 35.5 136 228 -163.7
Jackson Kowar 594.6 0.5 35.5 233 229 365.6
José Ureña 999.0 0.0 35.3 467 230 769
Blake Treinen 147.6 0.8 35.3 166 231 -83.4
José Cisnero 999.0 0.3 35.2 281 232 767
Cole Sulser 427.2 0.7 35.1 177 233 194.2
Joely Rodríguez 600.9 0.5 35.1 214 234 366.9
Lou Trivino 237.2 0.2 35.1 316 235 2.2
Jeff Hoffman 999.0 0.0 35.0 426 236 763
Craig Kimbrel 164.4 0.7 34.9 176 237 -72.6
Jake Brentz 600.8 0.3 34.8 301 238 362.8
Diego Castillo 409.2 0.6 34.7 191 239 170.2
Caleb Smith 592.3 0.0 34.6 441 240 352.3
Josh Sborz 600.9 0.3 34.5 290 241 359.9
Chad Kuhl 600.7 0.2 34.4 349 242 358.7
Dean Kremer 999.0 0.6 34.4 210 243 756
Keegan Thompson 999.0 0.0 34.4 429 244 755
Matt Wisler 599.6 0.4 34.3 270 245 354.6
Ryan Helsley 600.2 0.1 34.3 380 246 354.2
Anthony Banda 999.0 -0.1 34.2 514 247 752
Andrew Kittredge 257.6 0.8 34.1 169 248 9.6
Garrett Crochet 549.8 0.9 34.0 154 249 300.8
Tony Santillan 600.3 0.4 34.0 263 250 350.3
Eli Morgan 600.5 0.2 33.9 346 251 349.5
Kenley Jansen 90.0 0.6 33.7 187 252 -162
Andrew Wantz 999.0 0.4 33.7 262 253 746
Spencer Patton 599.0 0.1 33.4 410 254 345
Josh Hader 30.5 1.3 33.3 115 255 -224.5
Ross Stripling 583.7 0.4 33.2 256 256 327.7
Trevor Richards 600.5 0.4 33.1 261 257 343.5
Génesis Cabrera 596.6 0.6 32.9 205 258 338.6
Robert Stephenson 590.5 0.0 32.8 466 259 331.5
Vince Velasquez 599.4 0.3 32.8 308 260 339.4
Matthew Boyd 591.0 1.0 32.4 144 261 330
Josh Staumont 505.8 0.4 32.4 244 262 243.8
Adbert Alzolay 444.0 0.7 32.4 179 263 181
Corey Knebel 151.3 0.6 32.4 204 264 -112.7
Anthony Misiewicz 999.0 0.3 32.3 295 265 734
Dylan Floro 221.3 0.4 32.3 251 266 -44.7
A.J. Minter 596.5 0.5 32.3 215 267 329.5
Dinelson Lamet 400.5 0.9 32.2 157 268 132.5
Brusdar Graterol 567.4 0.7 32.0 178 269 298.4
Sam Howard 999.0 0.3 32.0 310 270 729
Sam Coonrod 600.9 0.4 31.8 271 271 329.9
Jonathan Loáisiga 393.7 1.0 31.8 142 272 121.7
Clay Holmes 600.3 0.6 31.7 201 273 327.3
Mark Melancon 131.8 0.4 31.6 247 274 -142.2
Jakob Junis 600.1 0.2 31.6 332 275 325.1
Archie Bradley 600.7 0.2 31.5 348 276 324.7
Tyler Duffey 577.8 0.4 31.3 260 277 300.8
Brett Anderson 600.9 0.6 31.3 190 278 322.9
Anthony Bender 412.8 0.7 31.1 184 279 133.8
Tim Mayza 599.3 0.6 31.1 197 280 319.3
Wandy Peralta 999.0 0.2 30.9 340 281 718
Ian Kennedy 409.8 -0.1 30.9 512 282 127.8
David Bednar 190.8 0.9 30.8 156 283 -92.2
Caleb Thielbar 999.0 0.6 30.8 195 284 715
Emmanuel Clase 57.5 1.3 30.7 112 285 -227.5
Jorge Alcala 459.0 0.6 30.7 209 286 173
Collin McHugh 545.5 0.6 30.4 189 287 258.5
Trevor Williams 601.0 0.2 30.4 326 288 313
Emilio Pagán 551.7 0.1 30.4 364 289 262.7
Patrick Murphy 999.0 0.4 30.2 254 290 709
J.P. Feyereisen 597.6 0.0 30.1 487 291 306.6
Kendall Graveman 531.0 0.4 30.1 240 292 239
Jorge López 599.9 0.4 30.0 265 293 306.9
Tyler Rogers 534.3 0.5 30.0 230 294 240.3
Kyle Zimmer 999.0 -0.2 29.7 528 295 704
Max Kranick 999.0 0.3 29.6 297 296 703
Joe Ross 599.6 0.7 29.4 170 297 302.6
Sergio Romo 600.5 0.0 29.4 450 298 302.5
Brooks Raley 600.9 0.3 29.3 296 299 301.9
Daniel Hudson 550.2 0.5 29.1 225 300 250.2
Pierce Johnson 433.3 0.6 29.1 200 301 132.3
Ryan Hendrix 999.0 0.0 29.1 468 302 697
Tim Hill 999.0 0.3 29.0 293 303 696
Nick Wittgren 999.0 0.1 29.0 374 304 695
Mike Foltynewicz 600.4 -0.1 28.9 511 305 295.4
Cionel Pérez 999.0 0.5 28.8 235 306 693
Lucas Gilbreath 596.4 0.0 28.7 471 307 289.4
Kwang Hyun Kim 593.0 0.3 28.6 303 308 285
Taylor Widener 600.5 0.0 28.4 424 309 291.5
Alex Vesia 600.0 0.3 28.4 289 310 290
Sean Doolittle 598.5 0.1 28.3 402 311 287.5
Rafael Montero 999.0 0.1 28.0 394 312 687
Connor Brogdon 600.4 0.4 27.9 252 313 287.4
Jake Woodford 599.5 0.0 27.9 460 314 285.5
John King 999.0 0.5 27.7 219 315 684
Josh Taylor 999.0 0.5 27.7 217 316 683
José Quijada 999.0 0.5 27.6 212 317 682
Josh Tomlin 999.0 -0.3 27.4 532 318 681
Sean Newcomb 600.1 0.1 27.3 400 319 281.1
Tyler Kinley 999.0 0.0 27.3 433 320 679
Justin Wilson 999.0 0.1 27.0 377 321 678
Luke Jackson 598.1 0.4 27.0 267 322 276.1
Bailey Falter 599.8 0.5 27.0 218 323 276.8
Jake McGee 251.1 0.4 27.0 241 324 -72.9
Darwinzon Hernandez 999.0 0.2 27.0 324 325 674
Tucker Davidson 584.9 0.6 26.9 193 326 258.9
Duane Underwood Jr. 999.0 0.1 26.8 417 327 672
Ryan Tepera 581.3 0.4 26.7 246 328 253.3
Matt Harvey 999.0 0.3 26.7 286 329 670
Miguel Castro 600.9 0.1 26.6 372 330 270.9
Johan Oviedo 999.0 0.2 26.3 356 331 668
Austin Warren 999.0 0.5 26.2 231 332 667
Joe Smith 999.0 0.1 26.1 361 333 666
Bryan Garcia 999.0 -0.2 26.1 530 334 665
Nick Sandlin 599.8 0.6 26.1 207 335 264.8
Drew Steckenrider 401.5 0.3 26.1 294 336 65.5
Alexander Wells 999.0 0.4 25.9 276 337 662
Alex Lange 999.0 0.1 25.8 366 338 661
Sean Reid-Foley 999.0 0.0 25.7 438 339 660
Jake Cousins 600.1 0.5 25.6 226 340 260.1
Zach Pop 999.0 0.2 25.5 345 341 658
Logan Allen 600.8 0.1 25.3 418 342 258.8
Sammy Long 596.1 0.3 25.2 307 343 253.1
Demarcus Evans 600.6 0.0 25.2 472 344 256.6
Zack Littell 599.3 0.1 25.2 367 345 254.3
JT Chargois 600.3 0.3 25.2 279 346 254.3
Reynaldo López 557.4 0.4 25.2 257 347 210.4
Jarlín García 600.8 0.1 25.1 411 348 252.8
Dillon Tate 600.9 0.4 25.0 264 349 251.9
Matt Shoemaker 999.0 0.0 25.0 457 350 649
Steven Brault 600.9 0.0 24.8 432 351 249.9
Yency Almonte 999.0 -0.2 24.8 524 352 647
Brett Martin 999.0 0.4 24.7 275 353 646
Craig Stammen 599.4 0.4 24.7 259 354 245.4
Anthony Bass 600.9 0.1 24.6 412 355 245.9
Albert Abreu 999.0 -0.1 24.6 496 356 643
Jeffrey Springs 999.0 0.3 24.5 287 357 642
Austin Davis 999.0 0.2 24.5 320 358 641
Phil Bickford 600.9 0.4 24.5 258 359 241.9
Tyler Clippard 600.8 -0.2 24.5 529 360 240.8
Blake Taylor 999.0 0.2 24.4 353 361 638
Greg Holland 600.2 0.0 24.4 473 362 238.2
Dominic Leone 600.8 0.1 24.2 371 363 237.8
Alex Reyes 395.7 0.2 24.0 335 364 31.7
Ryan Weathers 600.8 0.3 24.0 300 365 235.8
Austin Adams 600.0 0.4 24.0 273 366 234
Josh Rogers 600.8 -0.1 23.8 506 367 233.8
Art Warren 482.3 0.8 23.7 161 368 114.3
Yusmeiro Petit 601.0 -0.3 23.6 531 369 232
Sean Poppen 999.0 0.2 23.6 319 370 629
Tyler Gilbert 599.9 0.4 23.5 255 371 228.9
Adam Morgan 999.0 0.1 23.4 391 372 627
Michael King 600.5 0.5 23.4 220 373 227.5
Chris Martin 600.3 0.4 23.4 277 374 226.3
Dennis Santana 999.0 0.0 23.1 442 375 624
Jacob Webb 999.0 0.0 23.1 447 376 623
Joe Kelly 600.4 0.4 23.1 243 377 223.4
Wander Suero 999.0 0.2 23.0 317 378 621
Chi Chi González 999.0 0.1 22.9 396 379 620
Brad Brach 999.0 0.0 22.8 436 380 619
Kolby Allard 601.0 0.2 22.7 312 381 220
Brandon Kintzler 999.0 0.0 22.6 464 382 617
Trevor Bauer 208.4 1.0 22.5 141 383 -174.6
Richard Rodríguez 556.5 0.1 22.5 414 384 172.5
Touki Toussaint 598.6 0.1 22.3 369 385 213.6
Ralph Garza Jr. 999.0 -0.1 22.1 501 386 613
Jhoulys Chacín 999.0 -0.4 22.0 533 387 612
Tyler Matzek 573.2 0.7 22.0 181 388 185.2
Ryan Thompson 999.0 0.3 22.0 278 389 610
Luis Cessa 588.8 0.3 21.8 299 390 198.8
Andrew Miller 999.0 -0.1 21.8 510 391 608
Ross Detwiler 999.0 0.0 21.7 474 392 607
Taylor Clarke 999.0 0.0 21.5 437 393 606
Deolis Guerra 599.9 0.2 21.3 342 394 205.9
Ben Bowden 999.0 0.1 21.3 395 395 604
Joe Barlow 216.1 0.2 21.1 344 396 -179.9
Andrew Chafin 587.9 0.6 20.8 206 397 190.9
Nick Mears 999.0 0.0 20.8 480 398 601
Kyle Muller 590.8 0.3 20.8 302 399 191.8
Packy Naughton 999.0 0.3 20.7 292 400 599
Drew Smith 999.0 0.1 20.7 360 401 598
Michael Rucker 999.0 0.0 20.4 427 402 597
Derek Holland 999.0 -0.1 20.3 498 403 596
Andres Machado 999.0 -0.1 20.2 513 404 595
Trevor Cahill 999.0 0.2 20.2 313 405 594
Cody Ponce 999.0 0.2 20.2 354 406 593
Aaron Loup 594.4 0.4 20.0 274 407 187.4
Carlos Martínez 596.0 0.0 19.9 477 408 188
Jake Arrieta 999.0 0.0 19.8 481 409 590
Junior Guerra 999.0 0.0 19.8 490 410 589
Bryan Abreu 999.0 0.1 19.7 392 411 588
Charlie Barnes 999.0 0.2 19.7 339 412 587
Jay Jackson 999.0 0.2 19.7 347 413 586
Brandon Bielak 999.0 0.1 19.6 362 414 585
Nabil Crismatt 999.0 0.1 19.5 403 415 584
Wily Peralta 600.8 0.1 19.3 388 416 184.8
Tony Watson 999.0 0.0 19.2 479 417 582
Steve Cishek 595.3 0.0 19.1 439 418 177.3
Jesse Chavez 999.0 0.1 18.6 408 419 580
Mitch White 562.7 0.3 18.6 288 420 142.7
Steven Okert 999.0 0.3 18.6 304 421 578
Adam Cimber 600.6 0.3 18.6 282 422 178.6
Alex Claudio 999.0 0.0 18.5 463 423 576
Joe Mantiply 999.0 0.2 18.3 334 424 575
Trent Thornton 999.0 0.1 18.2 401 425 574
Buck Farmer 999.0 -0.2 18.2 519 426 573
Matt Foster 999.0 0.1 18.1 373 427 572
Domingo Tapia 999.0 0.0 18.0 492 428 571
Jharel Cotton 600.8 0.1 18.0 376 429 171.8
Ryan Borucki 999.0 0.2 17.8 355 430 569
Hunter Strickland 599.9 0.0 17.6 423 431 168.9
Paul Blackburn 999.0 0.2 17.4 315 432 567
Erik Swanson 600.1 0.3 17.1 291 433 167.1
Jordan Holloway 999.0 -0.1 17.1 502 434 565
Ervin Santana 999.0 -0.5 17.0 534 435 564
Blake Parker 999.0 0.1 17.0 393 436 563
Jacob Barnes 999.0 0.1 16.8 381 437 562
Brandon Workman 999.0 -0.2 16.7 521 438 561
Mason Thompson 999.0 0.0 16.7 448 439 560
Paul Campbell 999.0 -0.1 16.6 495 440 559
Michael Feliz 999.0 0.0 16.4 425 441 558
Tyler Zuber 999.0 0.0 16.3 462 442 557
Noé Ramirez 600.9 0.1 16.2 390 443 157.9
Justin Garza 999.0 -0.2 16.2 523 444 555
Yohan Ramirez 599.8 0.1 16.1 413 445 154.8
Miguel Sánchez 999.0 0.0 16.0 484 446 553
Kyle McGowin 999.0 0.2 15.9 321 447 552
Héctor Santiago 999.0 -0.2 15.6 525 448 551
Ryan Burr 999.0 0.0 15.5 440 449 550
Humberto Castellanos 999.0 0.1 15.5 370 450 549
JD Hammer 999.0 0.0 15.4 483 451 548
Phillips Valdez 999.0 -0.1 15.1 493 452 547
Kodi Whitley 999.0 0.2 14.4 350 453 546
Junior Fernández 999.0 0.1 14.3 415 454 545
Danny Coulombe 999.0 0.1 14.3 382 455 544
Drew Pomeranz 582.3 0.1 14.2 383 456 126.3
Rex Brothers 999.0 0.0 14.1 461 457 542
Juan Minaya 999.0 0.1 14.0 397 458 541
Anthony Castro 999.0 0.1 13.6 375 459 540
Tommy Nance 999.0 0.1 13.5 385 460 539
T.J. McFarland 999.0 0.1 13.3 378 461 538
Joel Payamps 999.0 0.3 13.2 285 462 537
Richard Bleier 600.7 0.5 13.1 224 463 137.7
Sean Guenther 999.0 0.2 13.0 323 464 535
Nick Neidert 999.0 -0.2 13.0 517 465 534
Ryne Harper 999.0 0.1 13.0 416 466 533
Anthony Kay 999.0 0.0 12.8 422 467 532
Hoby Milner 999.0 0.1 12.8 405 468 531
Matt Moore 999.0 -0.1 12.8 503 469 530
Dan Winkler 999.0 -0.1 12.7 515 470 529
Humberto Mejía 999.0 0.0 12.7 453 471 528
Matt Peacock 999.0 0.0 12.5 488 472 527
Brett de Geus 999.0 0.0 12.4 475 473 526
Aaron Sanchez 600.6 0.0 12.4 451 474 126.6
José Ruiz 999.0 0.2 12.4 357 475 524
Wade LeBlanc 999.0 0.2 12.4 358 476 523
Chris Mazza 999.0 0.3 12.3 309 477 522
Shane Greene 999.0 -0.1 12.1 499 478 521
Keynan Middleton 999.0 -0.1 12.1 500 479 520
Danny Duffy 587.9 0.2 12.0 318 480 107.9
Cody Poteet 999.0 0.1 11.7 407 481 518
José Álvarez 600.2 0.2 11.7 359 482 118.2
Chase Anderson 999.0 0.0 11.6 491 483 516
Louis Head 600.5 0.3 11.6 311 484 116.5
Kohei Arihara 999.0 0.0 11.5 449 485 514
Jordan Sheffield 999.0 -0.2 11.4 520 486 513
Ashton Goudeau 999.0 -0.2 10.6 527 487 512
Victor González 999.0 0.1 10.6 365 488 511
Yennsy Díaz 999.0 -0.1 10.4 516 489 510
Braxton Garrett 999.0 0.0 10.4 428 490 509
Chasen Shreve 999.0 -0.1 10.1 509 491 508
César Valdez 600.3 0.0 10.0 431 492 108.3
Sam Selman 999.0 0.0 9.5 444 493 506
Cam Bedrosian 999.0 0.0 9.1 458 494 505
Dustin May 559.5 0.3 9.1 298 495 64.5
Sam Clay 999.0 0.0 8.7 430 496 503
Caleb Baragar 999.0 -0.2 8.7 522 497 502
Dillon Maples 999.0 0.1 8.6 409 498 501
Daniel Ponce de Leon 600.7 0.0 8.5 456 499 101.7
Erasmo Ramírez 999.0 -0.2 7.9 526 500 499
Edwin Uceta 999.0 0.0 7.9 455 501 498
Miguel Diaz 999.0 -0.1 7.8 494 502 497
Marcos Diplán 999.0 0.0 7.4 459 503 496
Enyel De Los Santos 999.0 0.1 7.3 404 504 495
Tayler Saucedo 999.0 0.1 7.1 399 505 494
Robert Gsellman 999.0 -0.2 7.1 518 506 493
Wes Benjamin 999.0 -0.1 7.0 504 507 492
Thomas Eshelman 999.0 0.0 6.8 476 508 491
Reiss Knehr 999.0 0.0 6.7 446 509 490
Jake Faria 999.0 0.0 6.7 469 510 489
Luke Farrell 999.0 0.0 6.3 489 511 488
Robert Dugger 999.0 0.0 6.2 478 512 487
Casey Sadler 600.5 0.3 5.8 280 513 87.5
Shaun Anderson 999.0 0.0 5.7 445 514 485
Drew Hutchison 999.0 -0.1 5.5 497 515 484
Sean Nolin 999.0 0.0 5.3 485 516 483
Adrian Sampson 999.0 -0.1 4.9 508 517 482
Tyler Glasnow 598.8 0.2 4.9 314 518 80.8
Shawn Armstrong 999.0 0.0 4.6 454 519 480
Kevin Ginkel 999.0 0.0 4.3 470 520 479
Carson Fulmer 999.0 0.0 4.1 482 521 478
Edgar Santana 999.0 0.0 4.1 452 522 477
Kyle Crick 999.0 0.0 4.0 465 523 476
Daniel Castano 999.0 0.0 3.7 486 524 475
Kenta Maeda 600.7 0.1 3.3 384 525 75.7
John Gant 600.8 0.0 3.3 435 526 74.8
Chris Ellis 600.8 0.1 2.5 406 527 73.8
NA
NA
NA
NA
NA
NA

---
title: "Welcome to my 2022 Projections for Pitchers 5x5"
author: "Darshan Patel"
date: "`r Sys.Date()`"
output: 
  html_notebook:
    toc: true
    toc_float: true
    number_sections: true
    theme: sandstone
    highlight: tango
    fig_caption: true
    df_print: paged
---


<html>

<p>

Projections using Hypertuned model through XGboost

</p>

<p>

All data is from [FanGraphs.](https://www.fangraphs.com/) I have no affiliation with FanGraphs, but please consider contributing to their [website](https://plus.fangraphs.com/shop/) if you found this project informative.

</p>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = 'C:/Users/Admin/Documents/Learning Python Folder1/Python Essence Training/Fantasy-Baseball/Data')
options(knitr.table.format = "html") 
options(digits=2)
options(scipen = 100)
```

# Project Scope {.tabset .tabset-pills}

## Objective

This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection

The Categories used for prediction valuation are year-end rankings for the following metrics:

-    Wins
-   Saves
-   Strike Outs
-   ERA ( 9 \* Earned Runs per Inning Pitched)
-   WHIP (Walks and Hits per Inning Pitched)

![](IntroChart6x6.png)

------------------------------------------------------------------------

------------------------------------------------------------------------

# Processing the Data {.tabset .tabset-pills}

## Getting Data Into R

### Load Libraries

<p style="color:black;">

*First we need to load the packages that R needs to run the analysis*

</p>

```{r load library,message = FALSE,warning=FALSE}
library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling 
library(Matrix) #creating matricies for xgboost
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance 
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot 
library(owmr) #Removing Prefixes
library(kableExtra) # formatting HTML Tables
library(formattable) # formatting HTML Tables

```

The \# comments generally explain what additional functionality each library adds to R

### Load in Data

All data is downloaded from Fan Graphs from this [location](https://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=7&season=2021&month=0&season1=2015&ind=1&team=0&rost=0&age=0&filter=&players=0&startdate=2015-01-01&enddate=2021-12-31). The data is also available on my Github [here](https://github.com/dissipation/Fantasy-Baseball). There are player level and team data sets

```{r data read-in, results= 'hide',message=FALSE}

#data read-in
pitcher_data <- read_csv("FanGraphs Leaderboard_Pitching20IP.csv")

#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")


#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>% 
  rename_with( ~ paste0("T_", .x))
```

### Checking Team Data

`str` give information about an object, while `skim` provides a customizable summary

```{r checking team data}

#Output not shown for space
#str(FDG_Team2)

skim(FDG_Team2) %>%  
  tibble::as_tibble() #Remove this option for a normal HTML table
```

------------------------------------------------------------------------

## Understanding the Dataset

### Exploring the dataset

`skim` let's us see how the data was imported into R. Documentation can be found [here](https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html)  

```{r}

#Full Dataset dimensions

skimr::skim(pitcher_data) %>% 
  tibble::as_tibble() %>%  #Remove this option for a normal HTML table
  select(skim_type,skim_variable,complete_rate) %>% 
  filter(complete_rate >0.30) #250 Variables

#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populated
```
***  


Additionally let's look at how variables vary by year to see if there are any discrepancies there

```{r}

#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
pitcher_data_dist =
pitcher_data %>% 
 group_by(Season) %>% 
  summarize (Max_Games = max(G),
             Avg_W= mean(W)
             )

pitcher_data_dist

#Plot Win Data by Year
ggplot(pitcher_data_dist, aes(Season, Avg_W)) +
  geom_col()+
  ggtitle("Average Wins by Year")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))
```

------------------------------------------------------------------------

## Cleaning and Creating Initial Dataset for Model

What are some issues with the data?

1.  Many of Variables, such as K%, are being read in as characters

    -   Only Team and Player Name should be characters

2.  There is spotty data coverage in some of the variables (\~Variables have less than 30% Coverage)

3.  2020 Data only includes 60 games worth of data

    -   This was a season shortened due to Covid-19

4.  Team Data needs to be appended to pitcher Data by Team Name

------------------------------------------------------------------------

### Cleanly Changing all Variables that are characters to numeric.  

There are several ways to do this, we will identify the variables we want to change that are mis-identified. `parse_number` can be used to pull numbers from these variables. Additional ways to tackle this can be found [here.](https://stackoverflow.com/questions/8329059/how-to-convert-character-of-percentage-into-numeric-in-r)

```{r}

#Select Column names that are characters but not Team or Name, These should be percentages
pitcher_data_chars_to_convert <- pitcher_data %>% 
  select_if(is.character)%>% select(-Team,-Name) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution


#We can exclude the variables we converted and reintroduce them
pitcher_data_num <- pitcher_data %>% select(-colnames(pitcher_data_chars_to_convert))

pitcher_data2 = cbind(pitcher_data_num,pitcher_data_chars_to_convert) %>% 
  select (colnames(pitcher_data)) %>%  #preserve original order 
  dplyr::rename(flyball_perc = `FB%...50`,fastball_perc = `FB%...74`) #rename two ambiguous columns
  
skim(pitcher_data2) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()


#Logical variables are R's best guess, in our case they are all NA's and will be removed at a later step

```

The same can be done for the Team Data that is loaded

```{r}

#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>% 
  select_if(is.character)%>% select(-T_Team) %>% 
  mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using

#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))

FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>% 
  select (colnames(FDG_Team2)) %>%  #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`)  #rename two ambiguous columns

skim(FDG_Team3) %>% 
  as_tibble() %>% 
  group_by(skim_type) %>% 
  count()
```

------------------------------------------------------------------------

### Filtering Data with Low Coverage

I choose 30% coverage of data necessary but this can be adjusted up or down. This will also get rid of columns that are all `NA`.

```{r}

# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(pitcher_data2) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)

#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep) 

#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
pitcher_data3 = pitcher_data2 %>% 
  select(one_of(Player_cols_to_keep)) 

```

*Repeat the process for Team Variables*

```{r}
Team_cols_to_keep =
skim(FDG_Team3) %>% 
  dplyr::select(skim_type, skim_variable, complete_rate) %>% 
  filter (complete_rate > 0.30)


#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep) 

#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))

#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>% 
  select(one_of(Team_cols_to_keep)) 
```

------------------------------------------------------------------------

### Creating Variables Normalized by Year

Some Variables will need to be normalized by Innings_Pitched (IP) if they aren't a percentage already. Remaining Variables are percentages or indices so will not need to be transformed. The full data dictionary for these variables can be found on FanGraph's website [here.](https://library.fangraphs.com/pitching/complete-list-pitching/) for pitching variables and [here.](https://library.fangraphs.com/offense/offensive-statistics-list/) for hitting variables.

```{r}


pitcher_data4 = pitcher_data3 %>% 
  mutate( #create new variables based on existing variables
    W_IP = W/IP,
    L_IP =  L/IP, 
    ShO_IP = ShO/IP,
    SV_IP = SV/IP,
    BS_IP = BS/IP,
    TBF_IP = TBF/IP,
    H_IP = H/IP,
    R_IP = R/IP,
    ER_IP = ER/IP,
    HR_IP=HR/IP,
    BB_IP=BB/IP,
    IBB_IP=IBB/IP,
    HBP_IP=HBP/IP,
    WP_IP= WP/IP,
    BK_IP=BK/IP,
    SO_IP=SO/IP,
    GB_IP = GB/IP,   #Groundballs
    FB_IP =  FB/IP,  #FlyBalls
    LD_IP = LD/IP,   #LineDrives
    IFFB_IP = IFFB/IP,  #Infield Fly balls
    Balls_IP= Balls/IP,
    Strikes_IP= Strikes/IP,
    Pitches_IP= Pitches/IP,
    RS_IP= RS/IP,
    IFH_IP= IFH/IP,
    BU_IP= BU/IP,
    BUH_IP= BUH/IP,
    Pulls_IP= Pulls/IP,
    HLD_IP= HLD/IP,   
    SD_IP= SD/IP,    
    MD_IP= MD/IP,    
    Barrels_IP= Barrels/IP,
    HardHits_IP= HardHit/IP
  ) %>% select(-L,-G,-IP,-ShO,-BS,-(TBF:BK),-(GB:BUH),-Pulls,-(SD:MD),-Barrels,-HardHit,-Events)
               
#will be removed after data is lagged -FIP,-(RAR:WPA),,-(wFB:wCH),-(`ERA-`:`xFIP-`),-SIERA,-(`RA9-WAR`:`Age Rng`),-kwERA,-`wCH (pi)`:`wSL (pi)`,`K/9+`:`HR/FB%+`) 

#skim(pitcher_data4) %>% as_tibble()


```

*Repeat the process for Team Variables*

```{r}

FDG_Team5 = FDG_Team4 %>% 
  mutate( #create new variables based on existing variables
    T_H_T_PA = T_H/T_PA,
    T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
    T_x2b_T_PA = T_2B/T_PA,
    T_x3b_T_PA = T_3B/T_PA,
    T_HR_T_PA = T_HR/T_PA,
    T_R_T_PA = T_R/T_PA,
    T_RBI_T_PA = T_RBI/T_PA,
    T_BB_T_PA = T_BB/T_PA,
    T_IBB_T_PA = T_IBB/T_PA,
    T_SO_T_PA=T_SO/T_PA,
    T_HBP_T_PA=T_HBP/T_PA,
    T_SF_T_PA=T_SF/T_PA,
    T_SH_T_PA=T_SH/T_PA,
    T_GDP_T_PA= T_GDP/T_PA,#ground into double play
    T_SB_T_PA=T_SB/T_PA,
    T_CS_T_PA=T_CS/T_PA,
    T_GB_T_PA = T_GB/T_PA,   #Groundballs
    T_FB_T_PA =  T_FB/T_PA,  #FlyBalls
    T_LD_T_PA = T_LD/T_PA,   #LineDrives
    T_IFFB_T_PA = T_IFFB/T_PA,  #Infield Fly balls
    T_Pitches_T_PA= T_Pitches/T_PA,
    T_Balls_T_PA= T_Balls/T_PA,
    T_Strikes_T_PA= T_Strikes/T_PA,
    T_IFH_T_PA= T_IFH/T_PA,
    T_BU_T_PA= T_BU/T_PA,
    T_BUH_T_PA= T_BUH/T_PA,
    T_PH_T_PA= T_PH/T_PA,
    T_Barrels_T_PA= T_Barrels/T_PA,
    T_HardHits_T_PA= T_HardHit/T_PA
  ) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables


#skim(FDG_Team5) %>% as_tibble()

```

------------------------------------------------------------------------

### Creating Lagged Variables

There are several ways to lag a dataset **BY GROUP**.\
\* `Dplyr` way is [here.](https://statisticsglobe.com/create-lagged-variable-by-group-in-r).\
\* The `data.table` (the method used below) is [here.](https://stackoverflow.com/questions/26291988/how-to-create-a-lag-variable-within-each-group)

```{r}
#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance


#Order the dataset by lag columns
pitcher_data5 =  arrange(pitcher_data4, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_pitcher = data.table(pitcher_data5)

#designate columns to lag - which is all of them
cols1 = colnames(pitcher_data5)
anscols = paste("lag", cols1, sep="_")
DT_pitcher[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

pitcher_data6 = as.data.frame(DT_pitcher) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)

ncol(pitcher_data5) #251 - no lags
ncol(pitcher_data6) #497 - lagged data ~ (251 * 2)-5

```

------------------------------------------------------------------------

### Merging Team and Player Data

We can use either the `merge` function or the SQL functionality provided by the `sqldf` package to join the lagged player level data to the Team level data

```{r}

df_pitching_init = sqldf(
  "
  select a.*, b.*
  from pitcher_data6 a
  left join FDG_Team5 b
  on a.Team = b.T_Team and a.Season = b.T_Season
  
  "
)  %>% select(-T_Team,-T_Season,-T_Age,-T_G,-T_AB)# Unncessary Team Variables


nrow(df_pitching_init) - nrow(pitcher_data6) #check if any rows are duplicated


```

------------------------------------------------------------------------

# Creating Rankings for Players Based On Percentiles {.tabset .tabset-pills}

We can use Percentile based ranking to get rankings for players from the 2021 season.

## Worth of each stat

### Calculating past performance

Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. Data is not normalized by IP as certain stats such as Wins will be worth more when we do.\

```{r}

#Categories I include are:
#Wins, Saves, WHIP, ERA, SOs, Holds

df_pitching_init2 =  df_pitching_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Wins_share = order(order(rank(W_IP,ties.method = 'average'),decreasing = FALSE))/n(),
     SO_share = order(order(rank(SO_IP,ties.method = 'average'),decreasing = FALSE))/n(),
     SV_share = order(order(rank(SV_IP,ties.method = 'average'),decreasing = FALSE))/n(),
     WHIP_share = order(order(rank(WHIP,ties.method = 'average'),decreasing = FALSE))/n(),
     ERA_share = order(order(rank(ERA,ties.method = 'average'),decreasing = FALSE))/n(),
    HLD_share = 0,
    Worth = Wins_share+SO_share+SV_share+WHIP_share+ERA_share+HLD_share
    ) %>% 
  ungroup() 
```

Chart of the Distribution of initial percentiles\
As the chart below shows, the data is roughly normal.

```{r}

skewness((df_pitching_init2$Worth))

ggplot2::qplot(df_pitching_init2$Worth, main="Total Pitching Worth Dataset") + geom_histogram(colour="black", fill="steelblue")

min(df_pitching_init2$Worth)

max(df_pitching_init2$Worth)

ggpubr::ggqqplot(df_pitching_init2$Worth)

shapiro.test(df_pitching_init2$Worth)
```

------------------------------------------------------------------------

## 2021 Player Rankings - Per IP performance

### 2021 Player Rankings - Top Worth Players with Holds

Total Rankings for the players (Using 5x5 Scoring) can be found [here.]() While it looks like many of the top players have low worth scores, it is because we haven't applied a modifier for IP yet. Wins are harder to come by relative to any other stat and require more innings pitched.

```{r,warning=FALSE}


df_pitching_init2_raw =  df_pitching_init %>%
#  arrange(player_id,year) %>% 
  group_by(Season) %>% 
  mutate(
    Wins_share_raw = order(order(rank(W,ties.method = 'average'),decreasing = FALSE))/n(),
     SO_share_raw = order(order(rank(SO,ties.method = 'average'),decreasing = FALSE))/n(),
     SV_share_raw = order(order(rank(SV,ties.method = 'average'),decreasing = FALSE))/n(),
     WHIP_share = order(order(rank(WHIP,ties.method = 'average'),decreasing = FALSE))/n(),
     ERA_share = order(order(rank(ERA,ties.method = 'average'),decreasing = FALSE))/n(),
    HLD_share_raw = 0,
    Worth = Wins_share_raw+SO_share_raw+SV_share_raw+WHIP_share+ERA_share+HLD_share_raw
    ) %>% 
  ungroup() %>% 
select(-W,-SO,-SV,-WHIP,-ERA,-HLD)



options(digits=2)

df_pitching_init2021_raw =
df_pitching_init2_raw %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Wins_share_raw,SO_share_raw,SV_share_raw,WHIP_share,ERA_share,Worth)


df_pitching_init2021_raw %>%
  filter (Worth>3.5) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
```

------------------------------------------------------------------------

## 2021 Player Rankings - Actual Performance

### 2021 Player Rankings - Top Worth Players with Holds

While it looks like many of the top players have low worth scores, it is because we haven't applied a modifier for IP yet.

```{r,warning=FALSE}

options(digits=2)

df_pitching_init2021 =
df_pitching_init2 %>% 
  group_by(Name) %>% 
  filter(Season == 2021) %>% 
  arrange(desc(Worth)) %>% 
  select(Name,Wins_share,SO_share,SV_share,WHIP_share,ERA_share,HLD_share,Worth)


df_pitching_init2021 %>%
  filter (Worth>2.9) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)
```

------------------------------------------------------------------------

# Creating Model File {.tabset .tabset-pills}  

## Additional Data Prep  

### Remove Variables which are based off current hitting numbers  

Not all variables can be used for predictive modeling.  Variables that go into the percentile ranking or are non-normalized metrics created after the fact (such as `WAR` - Wins above Replacement or `RS` - Raw Run Support) should be removed. However, metrics that are normalized by a per pitch basis (such as `wFB/C`) can remain as we expect pitchers to have similar performance in these metrics one year out.  

```{r}
#Be careful about RS - Run Support and RS/9

#Creating a new dataset to keep original intact
df_pitching_init3 = df_pitching_init2 %>% 
  select (-Name)
```


Lagged Percentile (`_share`) Variables can be used for predictive modeling. However since these variables were created for the Worth metric they must also be removed for modeling purposes.  

```{r}

#Order the dataset by lag columns
df_pitching_init4 =  arrange(df_pitching_init3, playerid,Season) #playerid is the Fangraph id assigned to each player

# Convert dataframe to data.table format
DT_pitcher2 = data.table(df_pitching_init4)

#designate columns to lag - just the new shares
cols1 = (c('Wins_share','SO_share','SV_share', 'ERA_share','WHIP_share','HLD_share','Worth'))
anscols = paste("lag", cols1, sep="_") 
DT_pitcher2[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year

df_pitching_final = as.data.frame(DT_pitcher2) %>% 
  select(-c(Wins_share,SO_share,SV_share, ERA_share,WHIP_share,HLD_share))%>%
select(-FIP,-(RAR:WPA),-(wFB:wCH),-(`ERA-`:`xFIP-`),
       -SIERA,-(`RA9-WAR`:`Age Rng`),-kwERA,-(`wCH (pi)`:`wSL (pi)`),-(`K/9+`:`HR/FB%+`)) %>% select(-W,-SO,-SV,-HLD,-W_IP,-SO_IP,-SV_IP,-WHIP,-ERA,-HLD_IP)

```

### Creating Training/Test Split

We split the data into Training Data (which is used to create the model) and test data (which is used to validate the model)

```{r}

set.seed(15674)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = df_pitching_final$Worth, p = 0.80, list = FALSE)
# subset pitching data for training
tr_2021 <- df_pitching_final[inTrain,]
# subset the rest to test and validate trained model
te_2021 <- df_pitching_final[-inTrain,]

nrow(tr_2021)/nrow(df_pitching_final) #check if split is 0.8

```

### Treat Missing Data by Imputing Mean Value

Vtreat Package in R is excellent for treating data before using for modeling. Additional documentation can be found [here.](https://winvector.github.io/vtreat/index.html)

```{r}
treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = tr_2021, # training data
  varlist = colnames(tr_2021) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages

#clean stands for cleaned numerical variable, isBAD indicates that a value replacement has occurred (which indicates a missing value in this case), and lev is a binary indicator whether a particular value of that categorical variable was present.  

#### Checking Scoreframe

score_frame <- treat_plan_2021$scoreFrame %>% 
  select(varName, origName, code)

head(score_frame)


tr_treated_2021 <- vtreat::prepare(treat_plan_2021, tr_2021)
te_treated_2021 <- vtreat::prepare(treat_plan_2021, te_2021)


treat_plan_2021 <- vtreat::designTreatmentsZ(
  dframe = DT_pitcher2, # training data
  varlist = colnames(DT_pitcher2) %>% .[. != "hitting_score1"], # input variables = all training data columns, except random
  codeRestriction = c("clean", "isBAD", "lev"), # derived variables types (drop cat_P)
  verbose = FALSE) # suppress messages


total_treated_2021_pitching <- vtreat::prepare(treat_plan_2021, DT_pitcher2)

#tr_treated = tr
#te_treated = te

dim(tr_treated_2021) #note there are dummies for each player and team

```

------------------------------------------------------------------------

### Check Distribution of Training Population

The population used for Training should be indicative of Total Population

```{r}

ggplot2::qplot(tr_treated_2021$Worth, main="Training Set") + geom_histogram(colour="black", fill="steelblue") + theme_bw()

#The skewness is actually a bit better than the overall data set
skewness(tr_treated_2021$Worth) 


```

------------------------------------------------------------------------

# Running XGboost Model {.tabset .tabset-pills}

To keep things simple with modeling, we'll turn the training data into simple input variables for `caret::train`, dropping the response variable and converting the data frame to a matrix. Documentation for this approach to XGboost can be found [here.](https://www.kaggle.com/pelkoja/visual-xgboost-tuning-with-caret)

## Tuning the Model

### Initial Non-Tuned Model

Break the data set into x and y inputs with x being a matrix. `"_isBAD"` is a category created by the `Vtreat` package in case you want to identify rows

```{r}
input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>%                      
   select(!ends_with ("_isBAD")))

input_y <- tr_treated_2021$Worth

```

**XGBoost with Default Hyperparameters:**\
The Variable Importance (`caret::varImp(xgb_base_2021, scale = F`) from the caret package shows the contribution of each variable to the initial model. Since this is untuned, we can expect the percentage imporantance to change as the models iterate through potential hyperparameters.\
*XGBoost documentation can be found for more general models [here.](https://www.kaggle.com/code/rtatman/machine-learning-with-xgboost-in-r/notebook)*

```{r}

#Defaults for xgboost model
grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

#This is a blank train_control set, this will be updated after
train_control <- caret::trainControl(
  method = "none",
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )
```

------------------------------------------------------------------------

## Further Variable Selection

### Remove redundant and highly correlated variables

Selection Removal Step 1: Check for high correlations\
Normally, this step is done early, but those steps were reserved for preparing the data

```{r}

dep_cor1 <- t(as.data.frame(cor(tr_treated_2021[ , colnames(tr_treated_2021) != "Worth"],
                tr_treated_2021$Worth)))
dep_cor1 <-
as.data.frame(t(as.data.frame(dep_cor1)%>% 
  select(!starts_with("lag")) %>% #remove lag variables
  select(!contains("_isBAD")))) 

dep_cor1 <- tibble::rownames_to_column(dep_cor1,"VARIABLES")%>% #remove indicators for missing data
  filter(V1 > 0.40|V1 < -0.3)

dep_cor1

dep_cor2 <- colnames(row_to_names(t(dep_cor1),row_number = 1))
```

Let's Remove variables with high correlation to worth metric, and metrics that are calculated after a player's performance (such as `WPA`/`RE24`) or redundant (`RS_IP`)

```{r}

input_x <- as.matrix(((tr_treated_2021))%>%
   select(-Worth) %>% #Remove some variables variables
     select (-RS_IP,-ER_IP,-R_IP,-REW,-RE24,-Clutch,-WPA_slash_LI,-Season #Remove redundant variables or non/weighted variables
) %>%      
select(!ends_with ("_isBAD"))) #indicator variable for missing data

input_y <- tr_treated_2021$Worth


```

Run the model on the new dataset to make sure the variable importances look fine

```{r}

#Note Training parameters were set in initial model set up
xgb_base_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

caret::varImp(xgb_base_2021, scale = F  )


```

------------------------------------------------------------------------

## Model with new data

### Tuning All Hyperparameters

A tune grid allows us to test a large amount of hyper-parameters and find the model with the lowest RMSE for predictions.\
However, The more values you want to test and the greater the amount of Cross-Fold Validations (`method = "cv"`), the greater the computational time it will take. More information on the specific parameters can be found [here.](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/)

```{r}

# maximum number of trees
nrounds <- 1000

# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid <- expand.grid(
  nrounds = seq(from = 100, to = nrounds, by = 50),
  eta = c(0.01, 0.025, 0.05, 0.075, 0.1),
  max_depth = c(2, 4, 6, 8, 10),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  ## Note this was # out in the original code
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = FALSE # FALSE for reproducible results 
)



```

*Running the initial tuning model*

```{r}
#Note I will be timing these runs to give an estimate on how long this model takes to run
start_time <- Sys.time()

xgb_tune_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = FALSE
  ,verbosity = 0
)

end_time <- Sys.time()

end_time - start_time

```

*Tuning Plot and Variable Importance*

```{r}
varImp(xgb_tune_2021, scale = F  ) 


# helper function for the plots
tuneplot <- function(x, probs = .90) {
  ggplot(x) +
    coord_cartesian(ylim = c(quantile(x$results$RMSE, probs = probs), min(x$results$RMSE))) +
    theme_bw()
}

tuneplot(xgb_tune_2021)
```

------------------------------------------------------------------------

### Fine Tuning Model  

#### Second Tuning: Maximum Depth and Minimum Child Weight  

After fixing the learning rate to the best tune from the previous iteration and we'll also set maximum depth to 3 +-1 (or +2 if max_depth == 2) to experiment a bit around the suggested best tune in previous step. Then, well fix maximum depth and minimum child weight.  

```{r}
tune_grid2 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = ifelse(xgb_tune_2021$bestTune$max_depth == 2,
    c(xgb_tune_2021$bestTune$max_depth:4),
    xgb_tune_2021$bestTune$max_depth - 1:xgb_tune_2021$bestTune$max_depth + 1),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

xgb_tune2_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid2,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune2_2021)

xgb_tune2_2021$bestTune

varImp(xgb_tune2_2021, scale = F  ) 
```

------------------------------------------------------------------------

#### Third Tuning: Column and Row Sampling

```{r}

tune_grid3 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = 0,
  colsample_bytree = c(0.4, 0.6, 0.8, 1.0),
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = c(0.5, 0.75, 1.0)
)

xgb_tune3_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid3,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune3_2021, probs = .95)

xgb_tune3_2021$bestTune

varImp(xgb_tune3_2021, scale = F  ) 
```

------------------------------------------------------------------------

#### Fourth Tuning: Gamma

Next, we again pick the best values from previous step, and now will see whether changing the gamma has any effect on the model fit:

```{r}
tune_grid4 <- expand.grid(
  nrounds = seq(from = 50, to = nrounds, by = 50),
  eta = xgb_tune_2021$bestTune$eta,
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = c(0, 0.05,0.1, 0.2,0.4, 0.5, 0.7, 0.9, 1.0),
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)

xgb_tune4_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid4,
  method = "xgbTree",
  verbose = TRUE
)

tuneplot(xgb_tune4_2021)

xgb_tune4_2021$bestTune

varImp(xgb_tune4_2021, scale = F  ) 
```

------------------------------------------------------------------------

#### Fifth Tuning: Reducing the Learning Rate

Now, we have tuned the hyperparameters and can start reducing the learning rate to get to the final model:

```{r}
start_time <- Sys.time()

tune_grid5 <- expand.grid(
  nrounds = seq(from = 100, to = 10000, by = 75),
   eta = c(0.01, 0.015, 0.025,0.035, 0.05,0.75, 0.1),
  max_depth = xgb_tune2_2021$bestTune$max_depth,
  gamma = xgb_tune4_2021$bestTune$gamma,
  colsample_bytree = xgb_tune3_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune2_2021$bestTune$min_child_weight,
  subsample = xgb_tune3_2021$bestTune$subsample
)



xgb_tune5_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = tune_control,
  tuneGrid = tune_grid5,
  method = "xgbTree",
  verbose = TRUE
)

#tuneplot(xgb_tune5_2021)

end_time <- Sys.time()

end_time - start_time

xgb_tune5_2021$bestTune

varImp(xgb_tune5_2021, scale = F  ) 
```

------------------------------------------------------------------------

#### Fitting Final Model

```{r}

(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2021 <- caret::train(
  x = input_x,
  y = input_y,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))

varImp(xgb_model_2021, scale = F  ) 
```

------------------------------------------------------------------------

## Model Performance

### Checking Model on Test Split Data

We don't need to look too closely at are training data as Xgboost will heavily overfit the model based on that data. The more important part is how the model performs on in predicting our Test Sample that was not included.

```{r}


y_pred_test <- predict(xgb_model_2021, data.matrix(te_treated_2021))

test_stats= cbind((te_treated_2021$Worth),y_pred_test)

test_statsR2 = cor(test_stats[,1],test_stats[,2])^2

print(test_statsR2)


y_pred_train <- predict(xgb_model_2021, data.matrix(tr_treated_2021))

train_stats = cbind((tr_treated_2021$Worth),y_pred_train)

train_statsR2 = cor(train_stats[,1],train_stats[,2])^2

print(train_statsR2)

#test dataset
x <- select(te_treated_2021, -Worth)
y <- (te_treated_2021$Worth)

(xgb_model_rmse <- ModelMetrics::rmse(y, predict(xgb_model_2021, newdata = x)))

holdout_x <- select(tr_treated_2021, -Worth)
holdout_y <- tr_treated_2021$Worth

(xgb_model_rmse <- ModelMetrics::rmse(holdout_y, predict(xgb_model_2021, newdata = holdout_x)))


```

#### Graphical Representation of Model

```{r}

ggplot2::ggplot() +
  aes(x = test_stats[,1], y = test_stats[,2]) +
  geom_jitter() +
  xlab("Predicted Values") +
  ylab("Actual Values") +
  ggtitle("Results of Pitching Model on Test Data")+
  theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))+
  geom_smooth(method = "lm")

```

------------------------------------------------------------------------

# Creating 2022 Projections from Model {.tabset .tabset-pills}

## Re-fit model for Important Variables

Now that we have an acceptable model, we can use it to create projections for how well we think players should do in 2022 based on their hitting statistics in 2021. First let's reduce

Step 1: Only keep variables with high enough importance in model

```{r}


vip(xgb_model_2021, num_features = 30)  # 10 is the default, 30 gives a visual on the top 30 most important features of the model

unscalevi = vi(xgb_model_2021, method="model") #shows the numbers behind the plot

unscalevi$Importance_perc = with(unscalevi,Importance/sum(Importance)) #adds percentages 

unscalevi # importance by variables

variables_to_keep_2021 = subset(unscalevi, Importance_perc > 0.0010) %>% select(Variable) #Keep Variables that explain at least a small amount [0.1%] of the model. This is a low threshold for inclusion ,but you can adjust this

variables_to_keep_2021b = t(variables_to_keep_2021)

variables_to_keep_2022 = colnames(row_to_names(variables_to_keep_2021b,row_number = 1))

tr_treated_2022 = tr_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_")) #keep modeled important variables along with team indicator variables

te_treated_2022 = te_treated_2021 %>%  select(Worth,one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))

input_x_2022 = as.matrix(select(tr_treated_2022, -Worth))

input_y_2022 = tr_treated_2022$Worth



```

------------------------------------------------------------------------

Step 2: Re-fit model with reduced variable scope\
Note from the best tune below the `nrounds` - is the max I set above and `eta` is at the lowest possible value. This could cause potential overfitting issues, but from our Actual vs. Predicted Graph we know this not to be the case.

```{r}


(final_grid_2021 <- expand.grid(
  nrounds = xgb_tune5_2021$bestTune$nrounds,
  eta = xgb_tune5_2021$bestTune$eta,
  max_depth = xgb_tune5_2021$bestTune$max_depth,
  gamma = xgb_tune5_2021$bestTune$gamma,
  colsample_bytree = xgb_tune5_2021$bestTune$colsample_bytree,
  min_child_weight = xgb_tune5_2021$bestTune$min_child_weight,
  subsample = xgb_tune5_2021$bestTune$subsample
))

(xgb_model_2022 <- caret::train(
  x = input_x_2022,
  y = input_y_2022,
  trControl = train_control,
  tuneGrid = final_grid_2021,
  method = "xgbTree",
  verbose = TRUE
))


vip(xgb_model_2022, num_features = 30)

unscalevi24 = vi(xgb_model_2022, method="model")

unscalevi24$Importance_perc = with(unscalevi24,Importance/sum(Importance)) 

unscalevi24

# Save work for later prediction

save(xgb_model_2022,file = '2022_Pitching5x5_Model.Rdata')

pitching5x5 = xgb_model_2022

pitchinginput = input_x_2022

```

------------------------------------------------------------------------

## Get 2022 list of players

### Arrange the Data so the Columns are in the exact order as the model

First let's prepare a file for predicting based on our model object

```{r}


variableslag5x= row_to_names(as.data.frame(t(variables_to_keep_2022)),row_number = 1)  %>% select (starts_with("lag"))

variables_nolag5x = (owmr::remove_prefix(variableslag5x,"lag" , sep = "_"))

Data_Predict_2022a5x = total_treated_2021_pitching %>% select (one_of(colnames(variables_nolag5x)),Season,playerid)

colnames(Data_Predict_2022a5x) <- paste0("lag_", colnames(Data_Predict_2022a5x))

Data_Predict_2022b5x = total_treated_2021_pitching %>% select (one_of(colnames(variables_nolag5x)))
colnames(Data_Predict_2022b5x) = colnames(variableslag5x)

variables_to_keep_2022_nolag5x = total_treated_2021_pitching %>% select(one_of(variables_to_keep_2022),Season,playerid,starts_with("Team_lev_x_"))%>% select(-one_of(colnames(Data_Predict_2022b5x)))


Data_predict_20225x = sqldf(
  "
  select a.*,b.* from
  Data_Predict_2022a5x a,
  variables_to_keep_2022_nolag5x b
  on b.playerid = a.lag_playerid
  and b.Season = a.lag_Season
  "
) %>% select(-lag_playerid,lag_Season) %>%
  filter(Season == 2021) %>% 
  select(one_of(variables_to_keep_2022),starts_with("Team_lev_x_"))



```

------------------------------------------------------------------------

## Create Predictions for Model

### Run Projections on Players who Played in 2021

This is the raw prediction score per IP for each pitcher

```{r}

pitching_predictions5x = as.data.frame(predict(xgb_model_2022,Data_predict_20225x))

names(pitching_predictions5x) = c("Predict_Score")

Data_predict_2022_w_Pitching_Predictions5x = cbind(Data_predict_2022,pitching_predictions5x) %>% select(playerid,Predict_Score)

head(Data_predict_2022_w_Pitching_Predictions5x)

```

------------------------------------------------------------------------

### Load in Latest 2022 Projections for Innings Pitched

Downloaded from FanGraphs [here.](https://www.fangraphs.com/projections.aspx?pos=all&stats=pit&type=atc&team=0&lg=all&players=0)

```{r}
Latest_2022_pitchingdata_FP = read_csv("FanGraph_Fantasy_Baseball_Pitching.csv")

Latest_2022_pitchingdata_FP

```

------------------------------------------------------------------------

As you can see from the chart below there aren't many elite pitchers in the 87+ Predict score range.  

```{r, warning = False}


Pitching_Data_NonAdj_Projections5x = sqldf(
  "
  select a.*,b.Predict_Score
  from Latest_2022_pitchingdata_FP a 
  left join 
  Data_predict_2022_w_Pitching_Predictions5x b
  on a.playerid = b.playerid
  "
) %>% filter(ADP<370 | is.na(Predict_Score)==F)


Pitching_Data_Adj_Projections5x =
Pitching_Data_NonAdj_Projections5x %>% 
  mutate(
    Avg_IP = 60,
    AdjPredict_Score_raw = ifelse(is.na(Predict_Score),NA,Predict_Score*(IP/Avg_IP)),
    max_predscore= max(AdjPredict_Score_raw,na.rm = T),
    AdjPredict_Score = ifelse (is.na(AdjPredict_Score_raw),NA,AdjPredict_Score_raw *100/max_predscore),
    WAR_rank = order(order(rank(WAR,ties.method = 'average'),decreasing = TRUE)),
    AdjPredict_Score_Rank = order(order(rank(AdjPredict_Score,ties.method = 'average'),decreasing = TRUE))-sum(is.na(AdjPredict_Score)),
        Ranks_Above_ADP = ADP - AdjPredict_Score_Rank
  ) %>% select (Name,ADP,WAR, WAR_rank,AdjPredict_Score ,AdjPredict_Score_Rank,Ranks_Above_ADP)


  

ggplot2::qplot(Pitching_Data_Adj_Projections5x$AdjPredict_Score, main="Predictions") + geom_histogram(colour="black", fill="grey") + theme_bw()


```

------------------------------------------------------------------------

# 2022 Projections Full

## Table of Pitching Projections (Players who Didn't Play in 2021 - Recieve an NA)

AdjPredict_Score are normalized to 100

```{r}

tableexport =
Pitching_Data_Adj_Projections5x %>%
  arrange (ADP,WAR) %>% 
  kbl() %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)

save_kable(tableexport,file = "Pitching5x5.html")

#tableexport



```

This is a better formatted Table

```{r}



ft_dt <- Pitching_Data_Adj_Projections5x[1:nrow(Pitching_Data_Adj_Projections5x), 1:ncol(Pitching_Data_Adj_Projections5x)] %>% 
  filter(AdjPredict_Score_Rank>0)%>%  arrange((AdjPredict_Score_Rank))

ft_dt$ADP <- color_tile("white", "red")(ft_dt$ADP)

ft_dt$WAR <- color_bar("lightblue")(ft_dt$WAR)

ft_dt$AdjPredict_Score<- color_bar("lightblue")(ft_dt$AdjPredict_Score)

ft_dt$WAR_Rank <- color_tile("green","orange")(ft_dt$WAR_rank)

ft_dt$Predict_Rank <- color_tile("green","orange")(ft_dt$AdjPredict_Score_Rank) 


ft_dt$Ranks_Above_ADP <- 
  ifelse(
  ft_dt$Ranks_Above_ADP < 0,
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "red", italic = T),
  cell_spec(round(ft_dt$Ranks_Above_ADP,2), color = "green", italic = T)
)


ft_dt2 <- ft_dt[c("Name", "ADP", "WAR", "AdjPredict_Score", "WAR_Rank","Predict_Rank","Ranks_Above_ADP")]



table_export = 
kbl(ft_dt2, escape = F) %>% 
 kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T) %>%   column_spec(6, width = "3cm") %>%
  add_header_above(c(" ", "Scores" = 3, "Ranks" = 2," "))
save_kable(table_export,file = "Pitching5x5_updated.html")
  
table_export  






```

------------------------------------------------------------------------



</html>
